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(54) A branch executing system and method 

(57) In a data processing apparatus, a system for 
executing branches in single entry-single exit (SESE) 
basic blocks (BBs) contained within a program has 
means receiving the said program for determining a 
branch instruction within each basic block and for add- 
ing firing time information to the branch instruction. The 
firing time information identifies a time of execution of 
the branch instruction which is a variable number of 
instruction cycles prior to a time of execution of a last-to- 
be-executed instruction of the basic block The system 
also has a processor operative on received non-branch 
instructions in each basic block for processing the 
instructions, and means operative on the received 
branch instruction in the basic block in response to the 
firing time information for completing the execution of 
said branch instruction no later than the same time as 
the processor is processing the last-to-be-executed 
non-branch instruction so that the execution of the 
branch instruction occurs in parallel with the execution 
of the non-branch instructions thereby speeding the 
overall processing of the program by the system. 
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Description 

This invention generally relates to parallel processor computer systems and, more particularly to parallel processor 
computer systems having software for detecting natural concurrencies in instruction streams and having a plurality of 

5 processor elements for processing the detected natural concurrencies. 

Almost all prior art computer systems are of she "Von Neumann" construction. In fact, the first four generations of 
computers are Von Neumann machines which use a single large processor to sequentially process data. In recent 
years, considerable effort has been directed towards the creation of a fifth generation computer which is not of the Von 
Neumann type. One characteristic of the so-called fifth generation computer relates to its ability to perform parallel com- 

10 putation through use of a number of processor elements. With the advent of very large scale integration (VLSI) technol- 
ogy, the economic cost of using a number of individual processor elements becomes cost effective. 

Whether or not an actual fifth generation machine has yet been constructed is subject to debate, but various fea- 
tures have been defined and classified. Fifth-generation machines should be capable of using multiple-instruction, mul- 
tiple-data (MIMD) streams rather than simply being a single instruction, multiple-data (SIMD) system typical of fourth 

15 generation machines. The present invention is of the fifth-generation non-Von Neumann type. It is capable of using 
MIMD streams in single context (SC-MIMD) or in multiple context (MC-MIMD) as those terms are defined below. The 
present invention also finds application in the entire computer classification of single and multiple context SIMD (SC- 
SIMD and MC-SIMD) machines as well as single and multiple context, single-instruction, single data (SC-SISD and 
MC-SISD) machines. 

20 While the design of fifth-generation computer systems is fully in a state of flux, certain categories of systems have 
been defined. Some workers in the field base the type of computer upon the manner in which "control" or "synchroni- 
zation" of the system is performed. The control classification includes control-driven, data-driven, and reduction (or 
demand) driven. The control-driven system utilizes a centralized control such as a program counter or a master proc- 
essor to control processing by the slave processors. An example of a control-driven machine is the Non-Von-1 machine 

25 at Columbia University. In data-driven systems, control of the system results from the actual arrival of data rehired for 
processing. An example of a data-driven machine is the University of Manchester dataflow machine developed in Eng- 
land by Ian Watson. Reduction driven systems control processing when the processed activity demands results to 
occur. An example of a reduction processor is the MAGO reduction machine being developed at the University of North 
Carolina, Chapel Hill. The characteristics of the non-Von- 1 machine, the Manchester machine, and the MAGO reduc- 

30 tion machine are carefully discussed in Davis, "Computer Architecture," IEEE Spectrum. November, 1983. In compari- 
son, data-driven and demand-driven systems are decentralized approaches whereas control-driven systems represent 
a centralized approach. The present invention is more properly categorized in a fourth classification which could be 
termed "time-driven." Like data-driven and demand-driven systems, the control system of the present invention is 
decentralized. However, like the control-driven system, the present invention conducts processing when an activity is 

35 ready for execution. 

Most computer systems involving parallel processing concepts have proliferated from a large number of different 
types of computer architectures. In such cases, the unique nature of the computer architecture mandates or requires 
either its own processing language or substantial modification of an existing language to be adapted for use. To take 
advantage of the highly parallel structure of such computer architectures, the programmer is required to have an inti- 

40 mate knowledge of the computer architecture in order to write the necessary software. As a result, preparing programs 
for these machines requires substantial amounts of the users effort, money and time. 

Concurrent to this activity, work has also been progressing on the creation of new software and languages, inde- 
pendent of a specific computer architecture, that will expose (in a more direct manner), the inherent parallelism of the 
computation process. However, most effort in designing supercomputers has been concentrated in developing new 

45 hardware with much less effort directed to developing new software. 

Davis has speculated that the best approach to the design of a fifth-generation machine is to concentrate efforts on 
the mapping of the concurrent program tasks in the software onto the physical hardware resources of the computer 
architecture. Davis terms this approach one of "task-allocation" and touts it as being the ultimate key to successful fifth- 
generation architectures. He categorizes the allocation strategies into two generic types. "Static allocations" are per- 

so formed once, prior to execution, whereas "dynamic allocations" are performed by the hardware whenever the program 
is executed or run. The present invention utilizes a static allocation strategy and provides task allocations for a given 
program after compilation and prior to execution. The recognition of the "task allocation" approach in the design of fifth 
generation machines was used by Davis in the design of his "Data-driven Machine-ll" constructed at the University of 
Utah. In the Data-driven Machine-ll, the program was compiled into a program graph that resembles the actual machine 

55 graph or architecture. 

Task allocation is also referred to as "scheduling" in Gajski et al, "Essential Issues in Multi -processor Systems," 
Computer. June, 1985. Gajski et al set forth levels of scheduling to include high level, intermediate level, and low level 
scheduling. The present invention is one of low-level scheduling, but it does not use conventional scheduling policies of 
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•first-in-first-out", "round-robin", "shortest type in job-first", or "shortest-remaining-time." Gajski et al also recognize the 
advantage of static scheduling in that overhead costs are paid at compile time. However. Gajski et al's recognized dis- 
advantage, with respect to static scheduling, of possible inefficiencies in guessing the run time profile of each task is 
not found in the present invention. Therefore, the conventional approaches to low-level static scheduling found in the 
5 Occam language and the Bulldog compiler are not found in the software portion of the present invention. Indeed, the 
low-level static scheduling of the present invention provides the same type, if not better, utilization of the processors 
commonly seen in dynamic scheduling by the machine at run time. Furthermore, the low-level static scheduling of the 
present invention is performed automatically without intervention of programmers as required (for example) in the 
Occam language. 

10 Davis further recognizes that communication is a critical feature in concurrent processing in that the actual physical 
topology of the system significantly influences the overall performance of the system. 

For example, the fundamental problem found in most data-flow machines is the large amount of communication 
overhead in moving data between the processors. When data is moved over a bus, significant overhead, and possible 
degradation of the system, can result if data must contend for access to the bus. For example, the Arvind data-flow 
is machine, referenced in Davis, utilizes an l-structure stream in order to allow the data to remain in one place which then 
becomes accessible by all processors. The present invention, in one aspect, teaches a method of hardware and soft- 
ware based upon totally coupling the hardware resources thereby significantly simplifying the communication problems 
inherent in systems that perform multiprocessing. 

Another feature of non-Von Neumann type multiprocessor systems is the level of granularity of the parallelism 
20 being processed. Gajski et al term this "partitioning." The goal in designing a system, according to Gajski et al, is to 
obtain as much parallelism as possible with the lowest amount of overhead. The present invention performs concurrent 
processing at the lowest level available, the "per instruction" level. The present invention, in another aspect, teaches a 
method whereby this level of parallelism is obtainable without execution time overhead. 

Despite all of the work that has been done with multiprocessor parallel machines, Davis (Id. at 99) recognizes that 
25 such software and/or hardware approaches are primarily designed for individual tasks and are not universally suitable 
for all types of tasks or programs as has been the hallmark with Von Neumann architectures. The present invention sets 
forth a computer system and method that is generally suitable for many different types of tasks since it operates on the 
natural concurrencies existent in the instruction stream at a very fine level of granularity. 

In IEEE TRANS. ON COMPUTERS, Vol. C-32, No. 5, May 1983, pages 425-438, J.E. Requa et al.: "The piecewise 
30 dataflow architecture: Architectural concepts", there is disclosed a single-program system which is data-flow-driven in 
nature. Instructions are provided with various items of dependency information which eventually allow various instruc- 
tion timings to be derived. 

SYSTEMS-COMPUTERS-CONTROLS, Vol. 14 No. 4, July-August 1984, pages 10-18. T. Higuchi et al.: "A method 
of functional distribution using Concurrent Pascal" discloses a parallel task processing system in which a common bus 

35 and processor elements are connected through as high-level function interface device containing a microprogrammable 
microprocessor. Task processing times are improved by functionally distributing portions of the tasks assigned to the 
processor elements to the interlace device. 

In McDOWELL, CH., "A simple architecture for low level parallelism", Proc. of the IEEE 1983 Int. Conference on 
Parallel Processing, pages 472 - 477, there is described a system in which all scheduling of operations to be executed 

40 in parallel is done at the compile time. McDowell claims that this results in a simpler hardware, increased performance, 
and that preliminary results shows that significant speed-up is possible on both conventional sequential programs and 
on numerical array processing problems. 

All general purpose computer systems and many special purpose computer systems have operating systems or 
monitor/control programs which support the processing of multiple activities or programs. In some cases this process- 

45 ing occurs simultaneously: in other cases the processing alternates among the activities such that only one activity con- 
trols the processing resources at any one time. This latter case is often referred to as time sharing, time slicing, or 
concurrent (versus simultaneous) execution, depending on the particular computer system. Also depending on the spe- 
cific system, these individual activities or programs are usually referred to as tasks, processes, or contexts. In all cases, 
there is a method to support the switching of control among these various programs and between the programs and the 

so operating system, which is usually referred to as task switching, process switching, or context switching. Throughout 
this document. 

According to a first aspect of this invention, there is provided a parallel processor system for processing at least one 
stream of low level instructions, each said stream having a plurality of single entry-single exit basic blocks, the system 
comprising a plurality of individual processor elements, at least one logical resource driver for receiving said instruc- 
55 tions. a plurality of shared storage resources, and means for connecting each of said processor elements with any one 
of said plurality of shared storage resources whereby each of said processor elements can access any one of said stor- 
age resources during said instruction processing; characterised in that the system is adapted to perform parallel 
processing of at least one conventional program of at least one user in a single or multiple context configuration, in that 
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said at least one logical resource driver is adapted to store said instructions and to direct said instructions to selected 
processor elements at the correct execution time for the instruction; and further characterised in that the system further 
comprises means for connecting said plurality of processor elements with said logical resource drivers for transferring 
instructions from any said logical resource driver to any processor element. 

5 The invention also includes, according to a second aspect thereof, a method for parallel processing at least one 

stream of low level instructions in a parallel processing system having a plurality of individual processor elements, at 
least one logical resource driver, a plurality of shared storage resources, and means for connecting each of said proc- 
essor elements with any one of said plurality of shared storage resources, wherein each stream of instructions has a 
plurality of single entry-single exit basic blocks, the instructions being received by said logical resource driver and proc- 

10 essed in said plurality of processor elements, characterised in that the method includes performing parallel processing 
of at least one conventional program of at least one user in a single or multiple context configuration, said instructions 
being stored in said logical resource driver and directed from said logical resource driver to selected processor ele- 
ments at the correct execution time for the instruction, instructions being transferred from any said logical resource 
driver to any processor element. 

15 In a preferred method in accordance with the invention, intelligence is added to each instruction stream in response 
to determination of natural concurrencies within the stream, the added intelligence representing at least an instruction 
firing time and a logical processor number for each instruction, and characterised in that in the step of processing the 
instructions each of said plurality of processor elements receives all instructions for that processor element in order 
according to the instruction firing times, the instruction having the earliest time being delivered first. 

20 The description to follow, therefore, pertains to a non-Von Neumann MIMD computer system capable of simultane- 
ously operating upon many different and conventional programs by one or more different users. The natural concurren- 
cies in each program are statically allocated, at a very fine level of granularity, and intelligence is added to the instruction 
stream at essentially the object code level. The added intelligence can include, for example, a logical processor number 
and an instruction firing time in order to provide the time-driven decentralized control for the present invention. The 

25 detection and low level scheduling of the natural concurrencies and the adding of the intelligence occurs only once for 
a given program, after conventional compiling of the program, without user intervention and prior to execution. The 
results of this static allocation are executed on a system containing a plurality of processor elements. In one embodi- 
ment of the invention, the processors are identical. The processor elements, in this illustrated embodiment, contain no 
execution state information from the execution of previous instructions, that is, they are context free. In addition, a plu- 

30 rality of context files, one for each user, are provided wherein the plurality of processor elements can access any stor- 
age resource contained in any context file through total coupling of the processor element to the shared resource during 
the processing of an instruction. In a preferred aspect of the present invention, no condition code or results registers 
are found on the individual processor elements. 

The disclosed method and system are adaptable for use in single or multiple context SISD, SIMD, and MIMD con- 

35 figurations and may be further operative upon a myriad of conventional programs without user intervention. 

In one aspect, the system described below statically determines at a very fine level of granularity, the natural con- 
currencies in the basic blocks (BBs) of programs at essentially the object code level and adds intelligence to the instruc- 
tion stream in each basic block to provide a time driven decentralized control. The detection and low level scheduling 
of the natural concurrencies and the addition of the intelligence occurs only once for a given program after conventional 

40 compiling and prior to execution. At this time, prior to program execution, the use during later execution of all instruction 
resources is assigned. 

In another aspect, the described system further executes the basic blocks containing the added intelligence on a 
system containing a plurality of processor elements each of which, in this particular embodiment, does not retain exe- 
cution state information from prior operations. Hence, all processor elements in accordance with this embodiment of the 

45 invention are context free. Instructions are selected for execution based on the instruction firing time. Each processor 
element in this embodiment is capable of executing instructions on a per-instruction basis such that dependent instruc- 
tions can execute on the same or different processor elements. A given processor element in the present invention is 
capable of executing an instruction from one context followed by an instruction from another context. All operating and 
context information necessary for processing a given instruction is then contained elsewhere in the system. 

so It should be noted that many alternative implementations of context free processor elements are possible. In a non- 
pipelined implementation each processor element is monolithic and executes a single instruction to its completion prior 
to accepting another instruction. 

In another aspect of the described system, the context free processor is a pipelined processor element, in which 
each instruction requires several machine instruction clock cycles to complete. In general, during each clock cycle, a 

55 new instruction enters the pipeline and a completed instruction exists the pipeline, giving an effective instruction execu- 
tion time of a single instruction clock cycle. However, it is also possible to microcode some instructions to perform com- 
plicated functions requiring many machine instruction cycles. In such cases the entry of new instructions is suspended 
until the complex instruction completes, after which the normal instruction entry and exit sequence in each clock cycle 
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continues. Pipelining is a standard processor implementation technique and is discussed in more detail later. 

A preferred system and method in accordance with the present invention are described below by way of example 
with reference to the drawings in which: 

5 FIGURE 1 is the generalized flow representation of TOLL software; 

FIGURE 2 is a graphic representation of a sequential series of basic blocks found within a conventional compiler 
output; 

FIGURE 3 is a graphical presentation of the extended intelligence added to each basic block according to one 
embodiment of the present invention; 
io FIGURE 4 is a graphical representation showing the details of the extended intelligence added to each instruction 
within a given basic block according to one embodiment of the present invention; 
FIGURE 5 is the breakdown of the basic blocks into discrete execution sets; 

FIGURE 6 is a block diagram presentation of the architectural structure of apparatus according to a preferred 
embodiment of the present invention; 
15 FIGURES 7a - 7c represent an illustration of the network interconnections during three successive instruction firing 
times; 

FIGURES 8 - 1 1 are the flow diagrams setting forth features of the software according to one embodiment of the 
present invention; 

FIGURE 12 is a diagram describing one preferred form of the execution sets in the TOLL software; 
20 FIGURE 13 sets forth the register file organization according to a preferred embodiment of the present invention; 
FIGURE 14 illustrates a transfer between registers in different levels during a subroutine call; 
FIGURE 15 sets forth the structure of a logical resource driver (LRD) according to a preferred embodiment of the 
present invention; 

FIGURE 16 sets forth the structure of an instruction cache control and of the caches according to a preferred 
25 embodiment of the present invention; 

FIGURE 1 7 sets forth the structure of a PIQ buffer unit and a PIQ bus interface unit according to a preferred embod- 
iment of the present invention; 

FIGURE 18 sets forth interconnection of processor elements through the PE-LRD network to a PIQ processor 
alignment circuit according to a preferred embodiment of the present invention; 
30 FIGURE 19 sets forth the structure of a branch execution unit according to a preferred embodiment of the present 
invention; 

FIGURE 20 illustrates the organization of the condition code storage of a context file according to a preferred 
embodiment of the present invention; 

FIGURE 21 sets forth the structure of one embodiment of a pipelined processor element; and 
35 FIGURES 22(a) through 22(d) set forth the data structures used in connection with the processor element of Figure 
21. 

GENERAL DESCRIPTION 
40 i . Introduction 

In the following two sections, a general description of the software and hardware of the preferred system is present. 
The system is designed based upon a unique relationship between the hardware and software components. While 
many prior art approaches have primarily provided for multiprocessor parallel processing based upon a new architec- 
ts ture design or upon unique software algorithms, the present system is based upon a unique hardware/software rela- 
tionship. The software provides the intelligent information for the routing and synchronization of the instruction streams 
through the hardware. In the performance of these tasks, the software spatially and temporally manages all user acces- 
sible resources, for example, general registers, condition code storage registers, memory and stack pointers. The rout- 
ing and synchronization are performed without user intervention, and do not require changes to the original source 
so code. Additionally, the analysis of an instruction stream to provide the additional intelligent information for controlling the 
routing and synchronization of the instruction stream is performed only once during the program preparation process 
(often called "static allocation") of a given piece of software, and is not performed during execution (often called 
"dynamic allocation") as is found in some conventional prior art approaches. The analysis effected is hardware depend- 
ent, is performed on the object code output from conventional compilers, and advantageously, is therefore programming 
55 language independent. 

In other words, the software maps the object code program onto the hardware of the system so that it executes 
more efficiently than is typical of prior art systems. Thus the software must handle all hardware idiosyncrasies and their 
effects on execution of the program instructions stream. For example, the software must accommodate, when neces- 



5 



BNSDOCID: <EP 084021 3A2_L> 



EP 0 840 213 A2 

sary, processor elements which are either monolithic single cycle or pipelined. 

2. General Software Description 

s Referring to Figure 1, the software, generally termed "TOLL," is located in a computer processing system 160. 

Processing system 160 operates on a standard compiler output 100 which is typically object code or an intermediate 
object code such as "p-code." The output of a conventional compiler is a sequential stream of object code instructions 
hereinafter referred to as the instruction stream. Conventional language processors typically perform the following func- 
tions in generating the sequential instruction stream: 

10 

1 . lexical scan of the input text, 

2. syntactical scan of the condensed input text including symbol table construction, 

3. performance of machine independent optimization including parallelism detection and vectorization, and 

4. an intermediate (PSEUDO) code generation taking into account instruction functionality, resources required, and 
15 hardware structural properties. 

In the creation of the sequential instruction stream, the conventional compiler creates a series of basic blocks (BBs) 
which are single entry single exit (SESE) groups of contiguous instructions. See, for example, Alfred v. Aho and Jeffery 
D. Ullman, Principles o f Compiler Design . Addison Wesley, 1979, pg. 6, 409, 412-413 and David Gries, Compiler Con- 

20 struction for Digital Computers. Wiley, 1971. The conventional compiler, although it utilizes basic block information in 
the performance of its tasks, provides an output stream of sequential instructions without any basic block designations. 
The TOLL software, in this illustrated embodiment of the present invention, is designed to operate on the formed basic 
blocks (BBs) which are created within a conventional compiler. In each of the conventional SESE basic blocks there is 
exactly one branch (at the end of the block) and there are no control dependencies. The only relevant dependencies 

25 within the block are those between the resources required by the instructions 

The output of the compiler 100 in the basic block format is illustrated in Figure 2. Referring to Figure 1, the TOLL 
software 110 being processed in the computer 160 performs three basic determining functions on the compiler output 
100. These functions are to analyze the resource usage of the instructions 120, extend intelligence for each instruction 
in each basic block 130, and to build execution sets composed of one or more basic blocks 140. The resulting output of 

30 these three basic functions 120, 130, and 140 from processor 160 is the TOLL software output 150 of the present inven- 
tion. 

As noted above, the TOLL software operates on a compiler output 100 only once and without user intervention. 
Therefore, for any given program, the TOLL software need operate on the compiler output 100 only once. 

The functions 120, 130, 140 of the TOLL software 110 are, for example, to analyze the instruction stream in each 

35 basic block for natural concurrencies, to perform a translation of the instruction stream onto the actual hardware system 
of the present invention, to alleviate any hardware induced idiosyncrasies that may result from the translation process, 
and to encode the resulting instruction stream into an actual machine language to be used with the hardware of the 
present system. The TOLL software 1 1 0 performs these functions by analyzing the instruction stream and then assign- 
ing processor elements and resources as a result thereof. In one particular embodiment, the processors are context 

40 free. The TOLL software 1 1 0 provides the "synchronization" of the overall system by, for example, assigning appropriate 
firing times to each instruction in the output instruction stream. 

Instructions can be dependent on one another in a variety of ways although there are only three basic types of 
dependencies. First, there are procedural dependencies due to the actual structure of the instruction stream; that is, 
instructions may follow one another in other than a sequential order due to branches, jumps, etc. Second, operational 

45 dependencies are due to the finite number of hardware elements present in the system. These hardware elements 
include the general registers, condition code storage, stack pointers, processor elements, and memory. Thus if two 
instructions are to execute in parallel, they must not require the same hardware element unless they are both reading 
that element (provided of course, that the element is capable of being read simultaneously). Finally, there are data 
dependencies between instructions in the instruction stream. This form of dependency will be discussed at length later 

so and is particularly important if the processor elements include pipelined processors. Within a basic block, however, only 
data and operational dependencies are present. 

The TOLL software 110 must maintain the proper execution of a program. Thus, the TOLL software must assure 
that the code output 150, which represents instructions which will execute in parallel, generates the same results as 
those of the original serial code. To do this, the code 150 must access the resources in the same relative sequence as 

55 the serial code for instructions that are dependent on one another; that is, the relative ordering must be satisfied. How- 
ever, independent sets of instructions may be effectively executed out of sequence. 

In Table 1 is set forth an example of a SESE basic block representing the inner loop of a matrix multiply routine. 
While, this example will be used throughput this specification, the teachings of the present description are applicable to 
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any instruction stream. Referring to Table 1 , the instruction designation is set forth in the right hand column and a con- 
ventional object code functional representation, for this basic block, is represented in the left hand column. 



TABLE 1 



OBJECT CODE 


INSTRUCTION 


LD RO, (R10) + 


10 


LDR1, (R11) + 


11 


MM RO. R1, R2 


12 


ADD R2, R3, R3 


13 


DEC R4 


14 


BRNZR LOOP 


15 



The instruction stream contained within the SESE basic block set forth in Table 1 performs the following functions. 
In instruction 10, register RO is loaded with the contents of memory whose address is contained in R10. The instruction 

20 shown above increments the contents of R10 after the address has been fetched from R10. The same statement can 
be made tor instruction 11, with the exception that register R1 is loaded and register R11 is incremented. Instruction 12 
causes the contents of registers RO and R1 to be multiplied and the result is stored in register R2. In instruction 13, the 
contents of register R2 and register R3 are added and the result is stored in register R3. In instruction 14, register R4 is 
decremented. Instructions 12, 13 and 14 also generate a set of condition codes that reflect the status of their respective 

25 execution. In instruction 15, the contents of register R4 are indirectly tested for zero (via the condition codes generated 
by instruction 14). A branch occurs if the decrement operation produced a non-zero value; otherwise execution proceeds 
with the first instruction of the next basic block. 

Referring to Figure 1 , the first function performed by the TOLL software 1 1 0 is to analyze the resource usage of the 
instructions. In the illustrated example, these are instructions 10 through 15 of Table I. The TOLL software 110 thus ana- 

30 lyzes each instruction to ascertain the resource requirements of the instruction. 

This analysis is important in determining whether or not any resources are shared by any instructions and, there- 
fore, whether or not the instructions are independent of one another. Clearly, mutually independent instructions can be 
executed in parallel and are termed "naturally concurrent" Instructions that are independent can be executed in parallel 
and do not rely on one another for any information nor do they share any hardware resources in other than a read only 

35 manner. 

On the other hand, instructions that are dependent on one another can be formed into a set wherein each instruc- 
tion in the set is dependent on every other instruction in that set. The dependency may not be direct. The set can be 
described by the instructions within the set, or conversely, by the resources used by the instructions in the set. Instruc- 
tions within different sets are completely independent of one another, that is, there are no resources shared by the sets. 
40 Hence, the sets are independent of one another. 

In the example of Table 1 , the TOLL software will determine that there are two independent sets of dependent 
instructions: 



Set 1: 


CC1: 


10, 11, 12, 13 


Set 2: 


CC2: 


14, 15 



As can be seen, instructions 14 and 15 are independent of instructions 10 - 13. In set 2, 15 is directly dependent on 14. In 
so set 1 , 12 is directly dependent on 10 and 1 1 . Instruction 13 is directly dependent on 12 and indirectly dependent on 10 and 
11. 

The TOLL software detects these independent sets of dependent instructions and assigns a condition code group 
of designation(s), such as CC1 and CC2, to each set. This avoids the operational dependency that would occur if only 
one group or set of condition codes were available to the instruction stream. 
55 In other words, the results of the execution of instructions 10 and 11 are needed for the execution of instruction 12. 
Similarly, the results of the execution of instruction 12 are needed for the execution of instruction 13. In performing this 
analyses, the TOLL software 1 1 0 determines if an instruction will perform a read and/or a write to a resource. This func- 
tionality is termed the resource requirement analysis of the instruction stream. 
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It should be noted that, unlike the teachings of the prior art, it is not necessary for dependent instructions to execute 
on the same processor element The determination of dependencies is needed only to determine condition code sets 
and to determine instruction firing times, as will be described later. The present system can execute dependent instruc- 
tions on different processor elements, in one illustrated embodiment, because of the context free nature of the proces- 
5 sor elements and the total coupling of the processor elements to the shared resources, such as the register files, as will 
also be described below. 

The results of the analysis stage 120. for the example set forth in Table 1 , are set forth in Table 2. 



TABLE 2 


INSTRUCTION 


FUNCTION 


10 


Memory Read, Reg. Write, Reg. Read & Write 


11 


Memory Read, Reg. Write, Reg. Read & write 


12 


Two Reg. Reads, Reg. Write. Set Cond. Code (Set #1) 


13 


Two Reg. Reads, Reg. Write, Set Cond. Code (Set #1) 


14 


Read Reg., Reg. Write, Set Cond. Code (Set #2) 


15 


Read Cond. Code (Set #2) 



In Table 2, for instructions 10 and 11, a register is read and written followed by a memory read (at a distinct address), 
followed by a register write. Likewise, condition code writes and register reads and writes occur for instructions 12 

25 through 14. Finally, instruction 15 is a simple read of a condition code storage register and a resulting branch or loop. 

The second step or pass 130 through the SESE basic block 100 is to add or extend intelligence to each instruction 
within the basic block. In the preferred embodiment of the invention, this is the assignment of an instruction's execution 
time relative to the execution times of the other instructions in the stream, the assignment of a processor number on 
which the instruction is to execute and the assignment of any so-called static shared context storage mapping informa- 

30 tion that may be needed by the instruction. 

In order to assign the firing time to an instruction, the temporal usage of each resource required by the instruction 
must be considered. In the illustrated embodiment, the temporal usage of each resource is characterized by a "free 
time" and a "load time." The free time is the last time the resource was read or written by an instruction. The load time 
is the last time the resource was modified by an instruction. If an instruction is going to modify a resource, it must exe- 

35 cute the modification after the last time the resource was used, in other words, after the free time. If an instruction is 
going to read the resource, it must perform the read after the last time the resource has been loaded, in other words, 
after the load time. 

The relationship between the temporal usage of each resource and the actual usage of the resource is as follows. 
If an instruction is going to write/modify the resource, the last time the resource is read or written by other instructions 

40 (i.e., the "free time" for the resource) plus one time interval will be the earliest firing time for this instruction. The "plus 
one time interval" comes from the fact that an instruction is still using the resource during the free time. On the other 
hand, if the instruction reads a resource, the last time the resource is modified by other instructions (i.e., the load time 
for the resource) plus one time interval will be the earliest instruction firing time. The "plus one time interval" comes from 
the time required for the instruction that is performing the load to execute. 

45 The discussion above assumes that the exact location of the resource that is accessed is known. This is always 
true of resources that are directly named such as general registers and condition code storage. However, memory oper- 
ations may, in general, be to locations unknown at compile time. In particular, addresses that are generated by effective 
addressing constructs fall in this class. In the previous example, it has been assumed (for the purposes of communicat- 
ing the basic concepts of TOLL) that the addresses used by instructions 10 and 11 are distinct. If this were not the case, 

so the TOLL software would assure that only those instructions that did not use memory would be allowed to execute in 
parallel with an instruction that was accessing an unknown location in memory. 

The instruction firing time is evaluated by the TOLL software 1 10 for each resource that the instruction uses. These 
"candidate" firing times are then compared to determine which is the largest or latest time. The latest time determines 
the actual firing time assigned to the instruction. At this point, the TOLL software 1 10 updat s all of the resources' free 

55 and load times, to reflect the firing time assigned to the instruction. The TOLL software 110 then proceeds to analyze 
the next instruction. 

There are many methods available for determining inter-instruction dependencies within a basic block. The previ- 
ous discussion is just one possible implementation assuming a specific compiler-TOLL partitioning. Many other com- 
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piler-TOLL partitionings and methods for determining inter-instruction dependencies may be possible and realizable to 
one skilled in the art. Thus, the illustrated TOLL software uses a linked list analysis to represent the data dependencies 
within a basic block. Other possible data structures that could be used are trees, stacks, etc. 

Assume a linked list representation is used for the analysis and representation of the inter-instruction dependen- 
5 cies. Each register is associated with a set of pointers to the instructions that use the value contained in that register. 
For the matrix multiply example in Table 1 , the resource usage is set forth in Table 3: 



TABLE 3 



10 



15 



Resource 


Loaded By 


Read By 


RO 


10 


12 


R1 


11 


12 


R2 


12 


13 


R3 


13 


13. 12 


R4 


14 


15 


R10 


10 


10 


R11 


11 


11 



Thus, by following the "Read by" links and knowing the resource utilization for each instruction, the independencies of 
Sets 1 and 2, above, are constructed in the analyze instruction stage 120 (Figure 1) by TOLL 110. 
25 For purposes of analyzing further the example of Table 1 . it is assumed that the basic block commences with an 
arbitrary time interval in an instruction stream, such as, for exanple. time interval T16. In other words, this particular 
basic block in time sequence is assumed to start with time interval T16. The results of the analysis in stage 1 20 are set 
forth in Table 4. 

30 

TABLE 4 



40 



45 



REG 


10 


11 


12 


13 


14 


15 


RO 


T16 




T17 








R1 




T16 


T17 








R2 






T17 


T18 






R3 








Il£ 






R4 










T16 




CC1 






T17 


T18 






CC2 












T17 


R10 


T16 












R11 




T16 











The vertical direction in Table 4 represents the general registers and condition code storage registers. The horizontal 
direction in the table represents the instructions in the basic block example of Table 1 . The entries in the table represent 

so usage of a register by an instruction. Thus, instruction 10 requires that register R10 be read and written and register R0 
written at time T16, the start of execution of the basic block. 

Under the teachings of the present invention, there is no reason that registers R1, R1 1, and R4 cannot also have 
operations performed on them during time T16. The three instructions, 10, 11, and 14, are data independent of each 
other and can be executed concurrently during time T16. Instruction 12, however, requires first that registers R0 and R1 

55 be loaded so that the results of the load operation can be multiplied. The results of the multiplication are stored in reg- 
ister R2. Although, register R2 could in theory be operated on in time T16, instruction 12 is data dependent upon the 
results of loading registers R0 and R1, which occurs during time T16. Therefore, the completion of instruction 12 must 
occur during or after time frame T1 7. Hence, in Table 4 above, the entry T1 7 for the intersection of instruction 12 and 
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register R2 is underlined because it is data dependent. Likewise, instruction 13 requires data in register R2 which first 
occurs during time T17. Hence, instruction 13 can operate on register R2 only during or after time T18. Instruction 15 
depends upon the reading of the condition code storage CC2 which is updated by instruction 14. The reading of the con- 
dition code storage CC2 is data dependent upon the results stored in time T1 6 and, therefore, must occur during or after 
5 the next time, T1 7. 

Hence, in stage 130, the object code instructions are assigned "instruction firing times" (IFTs) as set forth in Table 
5 based upon the above analysis. 



10 TABLE 5 



OBJECT CODE 
INSTRUCTION 


INSTRUCTION FIRING 
TIME (IFT) 


10 


T16 


11 


T16 


12 


T17 


13 


T18 


14 


T16 


15 


T17 



Each of the instructions in the sequential instruction stream in a basic block can be performed in the assigned time inter- 
vals. As is clear in Table 5, the same six instructions of Table 1, normally processed sequentially in six cycles, can be 
processed, under the teachings of the present invention, in only three firing times: T16, T17, and T18. The instruction 
firing time (IFT) provides the "time-driven" feature of the present system. 

The next function performed by stage 130. in the illustrated embodiment, is to reorder the natural concurrencies in 
the instruction stream according to instruction firing times (IFTs) and then to assign the instructions to the individual log- 
ical parallel processors. It should be noted that the reordering is only required due to limitations in currently available 
technology. If true fully associative memories were available, the reordering of the stream would not be required and the 
processor numbers could be assigned in a first come, first served manner. The hardware of the instruction selection 
mechanism could be appropriately modified by one skilled in the art to address this mode of operation. 

For example, assuming currently available technology, and a system with four parallel processor elements (PEs) 
and a branch execution unit (BEU) within each LRD, the processor elements and the branch execution unit can be 
assigned as set forth in Table 6 below. It should be noted that the processor elements execute all non-branch instruc- 
tions, while the branch execution unit (BEU) of the present invention executes all branch instructions. These hardware 
circuitries will be described in greater detail subsequently. 



TABLE 6 



Logical Processor 
Number 


T16 


T17 


T18 


O 


I0 


12 


13 


1 


11 






2 


14 






3 








BEU 




I5(delay) 





Hence during time interval T16, parallel processor elements 0, 1 . and 2 concurrently process instructions 10 11 and 14 
respectively. Likewise, during the next time interval T17, parallel processor element 0 and the BEU concurrently proc- 
ess instructions 12 and 15 respectively. And finally, during time interval T18, processor element 0 processes instruction 
13. During instruction firing times T16, T17, and T18, parallel processor element 3 is not utilized in the example of Table 
1 . In actuality, since the last instruction is a branch instruction, the branch cannot occur until the last processing is fin- 
ished in time T18 for instruction 13. A delay field is built into the processing of instruction 15 so that even though it is proc- 
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essed in time interval T17 (the earliest possible time), its execution is delayed so that looping or branching out occurs 
after instruction 13 has executed. 

In summary, the TOLL software 1 1 0 of the present illustrated embodiment, in stage 1 30, examines each individual 
instruction and its resource usage both as to type and as to location (if known) (e.g., Table 3). It then assigns instruction 

5 firing times (IFTs) on the basis of this resource usage (e.g., Table 4), reorders the instruction stream based upon these 
firing times (e.g., Table 5) and assigns logical processor numbers (LPNs) (e.g., Table 6) as a result thereof. 

The extended intelligence information involving the logical processor number (LPN) and the instruction firing time 
(FT) is, in the illustrated embodiment, added to each instruction of the basic block as shown in Figures 3 and 4. As will 
also be pointed out subsequently, the extended intelligence (EXT) for each instruction in a basic block (BB) will be cor- 

10 related with the actual physical processor architecture of the present system. The correlation is performed by the sys- 
tem hardware. It is important to note that the actual hardware may contain less, the same as, or more physical 
processor elements than the number of logical processor elements. 

The Shared Context Storage Mapping (SCSM) information in Figure 4 and attached to each instruction in this illus- 
trated and preferred embodiment of the invention, has a static and a dynamic component. The static component of the 

is SCSM information is attached by the TOLL software or compiler and is a result of the static analysis of the instruction 
stream. Dynamic information is attached at execution time by a logical resource drive (LRD) as will be discussed later. 

At this stage 1 30, the illustrated TOLL software 110 has analyzed the instruction stream as a set of single entry sin- 
gle exit (SESE) basic blocks (BBs) for natural concurrencies that can be processed individually by separate processor 
elements (PEs) and has assigned to each instruction an instruction firing time (IFT) and a logical processor number 

20 (LPN). The instruction stream is thus pre-processed by the TOLL software to statically allocate all processing resources 
in advance of execution. This is done once for any given program and is applicable to any one of a number of different 
program languages such as FORTRAN, COBOL, PASCAL, BASIC, etc. 

Referring to Figure 5, a series of basic blocks (BBs) can form a single execution set (ES) and in stage 140, the 
TOLL software 1 10 builds such execution sets (ESs). Once the TOLL software identifies an execution set 500, header 

25 510 and/or trailer 520 information is added at the beginning and/or end of the set. In the preferred embodiment, only 
header information 510 is attached at the beginning of the set, although the invention is not so limited. 

Basic blocks generally follow one another in the instruction stream. There may be no need for reordering of the 
basic blocks even though individual instructions within a basic block as discussed above, are reordered and assigned 
extended intelligence information. However, the invention is not so limited. Each basic block is single entry and single 

30 exit (SESE) with the exit through a branch instruction. Typically, the branch to another instruction is within a localized 
neighborhood such as within 400 instructions of the branch. The purpose of forming the execution sets (stage 140) is 
to determine the minimum number of basic blocks that can exist within an execution set such that the number of 
"instruction cache faults" is minimized. In other words, in a given execution set, branches or transfers out of an execu- 
tion set are statistically minimized. The TOLL software in stage 140, can use a number of conventional techniques for 

35 solving this linear programming-like problem, a problem which is based upon branch distances and the like. The pur- 
pose is to define an execution set as set forth in Figure 5 so that the execution set can be placed in a hardware cache, 
as will be discussed subsequently, to minimize instruction cache faults (i.e., transfers out of the execution set). 

What has been set forth above is an example, illustrated using Tables 1 through 6, of the TOLL software 1 10 in a 
single context application. In essence, the TOLL software determines the natural concurrencies within the instruction 

40 streams for each basic block within a given program. The TOLL software adds, in the illustrated embodiment, an instruc- 
tion firing time (IFT) and a logical processor number (LPN) to each instruction in accordance with the determined nat- 
ural concurrencies. All processing resources are statically allocated in advance of processing. The TOLL software of 
the present system can be used in connection with a number of simultaneously executing different programs, each pro- 
gram being used by the same or different users on a processing system, as will be described and explained below. 3. 

45 General Hardware Description Referring to Figure 6, the block diagram format of the system architecture of the present 
system, termed the TDA system architecture 600, includes a memory sub-system 610 interconnected to a plurality of 
logical resource drivers (LRDs) 620 over a network 630. The logical resource drivers 620 are further interconnected to 
a plurality of processor elements 640 over a network 650. Finally, the plurality of processor elements 640 are intercon- 
nected over a network 670 to the shared resources containing a pool of register set and condition code set files 660. 

so The LRD-memory network 630, the PE-LRD network 650, and the PE-context file network 670 are full access networks 
that could be composed of conventional crossbar networks, omega networks, banyan networks, or the like. The net- 
works are full access (non-blocking in space) so that, for example, any processor element 640 can access any register 
file or condition code storage in any context (as defined hereinbelow) file 660. Likewise, any processor element 640 can 
access any logical resource driver 620 and any logical resource driver 620 can access any portion of the memory sub- 

55 system 610. In addition, the PE-LRD and PE-context file networks are non-blocking in time. In other words, these two 
networks guarantee access to any resource from any resource regardless of load conditions on the network. The archi- 
tecture of the switching elements of the PE-LRD network 650 and the PE-context file network 670 are considerably sim- 
plified since the TOLL software guarantees that collisions in the network will never occur. The diagram of Figure 6 
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represents an MIMD system wherein each context file 660 corresponds to at least one user program. 

The memory subsystem 610 can be constructed using a conventional memory architecture and conventional mem- 
ory elements. There are many such architectures and elements that could be employed by a person skilled in the art 
and which would satisfy the requirements of this system. For example, a banked memory architecture could be used. 

5 (High Speed Memory Systems. A.V. Pohm and OP. Agrawal, Reston Publishing Co., 1983.) 

The logical resource drivers 620 are unique to the system architecture 600 of the present system. Each illustrated 
LRD provides the data cache and instruction selection support for a single user (who is assigned a context file) on a 
timeshared basis. The LRDs receive execution sets from the various users wherein one or more execution sets for a 
context are stored on an LRD. The instructions within the basic blocks of the stored execution sets are stored in queues 

10 based on the previously assigned logical processor number. For example, if the system has 64 users and 8 LRDs, 8 
users would share an individual LRD on a timeshared basis. The operating system determines which user is assigned 
to which LRD and for how long. The LRD is detailed at length subsequently 

The processor elements 640 are also unique to the TDA system architecture and will be discussed later. These 
processor elements may display a context free stochastic property in which the future state of the system depends only 

15 on the present state of the system and not on the path by which the present state was achieved. As such, architecturally, 
the context free processor elements are uniquely different from conventional processor elements in two ways. First, the 
elements have no internal permanent storage or remnants of past events such as general purpose registers or program 
status words. Second, the elements do not perform any routing or synchronization functions. These tasks are per- 
formed by the TOLL software and are implemented in the LRDs. The significance of the architecture is that the context 

20 free processor elements are a true shared resource to the LRDs. In another preferred particular embodiment of the 
invention wherein pipelined processor elements are employed, the processors are not strictly context free as was 
described previously. 

Finally, the register set and condition code set files 660 can also be constructed of commonly available components 
such as AMD 29300 series register files, available from Advanced Micro Devices, 901 Thompson Place, P.O. Box 3453, 

25 Sunnyvale, California 94088. However, the particular configuration of the files 660 illustrated in Figure 6 is unique under 
the teachings of the present invention and will be discussed later. 

The general operation of the present system, based upon the example set forth in Table 1 , is illustrated with respect 
to the processor-context register file communication in Figures 7a, 7b, and 7c. As mentioned, the time-driven control of 
the present illustrated embodiment of the invention is found in the addition of the extended intelligence relating to the 

so logical processor number (LPN) and the instruction firing time (IFT) as specifically set forth in Figure 4. Figure 7 gener- 
ally represents the configuration of the processor elements PE0 through PE3 with registers R0 through R5,..., R10 and 
R1 1 of the register set and condition code set file 660. 

In explaining the operation of the TDA system architecture 600 for the single user example in Table 1 , reference is 
made to Tables 3 through 5. In the example, for instruction firing time T1 6, the context f ile-PE network 670 interconnects 

35 processor element PE0 with registers R0 and R10, processor element PE1 with registers R1 and R1 1 , and processor 
element PE2 with register R4. Hence, during time T16, the three processor elements PE0, PE1 , and PE2 process 
instructions 10, 11, and 14 concurrently and store the results in registers R0, R10, R1, R11, and R4. During timeT16, the 
LRD 620 selects and delivers the instructions that can fire (execute) during time T17 to the appropriate processor ele- 
ments. Referring to Figure 7b, during instruction firing time T1 7, only processor element PE0, which is now assigned to 

40 process instruction 12 interconnects with registers R0, R1, and R2. The BEU (not shown in Figures 7a, 7b, and 7c) is 
also connected to the condition code storage. Finally, referring to Figure 7c, during instruction firing time T18, processor 
element PE0 is connected to registers R2 and R3. 

Several important observations need to be made. First, when a particular processor element (PE) places results of 
its operation in a register, any processor element, during a subsequent instruction firing time (IFT), can be intercon- 

45 nected to that register as it executes its operation. For example, processor element PE1 for instruction 11 loads register 
R1 with the contents of a memory location during IFT T16 as shown in Figure 7a. During instruction firing time T17, 
processor element PE0 is interconnected with register R1 to perform an additional operation on the results stored 
therein. Each processor element (PE) is "totally coupled" to the necessary registers in the register file 660 during any 
particular instruction firing time (IFT) and, therefore, there is no need to move the data out of the register file for delivery 

so to another resource; e.g. in another processor's register as in some conventional approaches. 

In other words, each processor element can be totally coupled, during any individual instruction firing time, to any 
shared register in files 660. In addition, none of the processor elements has to contend (or wait) for the availability of a 
particular register or for results to be placed in a particular register as is found in some prior art systems. Also, during 
any individual firing time, any processor element has full access to any configuration of registers in the register set file 

55 660 as if such registers were its own internal registers. 

Hence, the intelligence added to the instruction stream is based upon detected natural concurrencies within the 
object code. The detected concurrencies are analyzed by the TOLL software, which in one illustrated embodiment log- 
ically assigns individual logical processor elements (LPNs) to process the instructions in parallel, and unique firing 
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times (IFTs) so that each processor element (PE), for its given instruction, will have all necessary resources available 
for processing according to its instruction requirements. In the above example, the logical processor numbers corre- 
spond to the actual processor assignment, that is, LPNO corresponds to PEO, LPN1 to PE1, LPN2 to PE2, and LPN3 
to PE3. The invention is not so limited since any order such as LPNO to PE1 , LPN1 to PE2, etc. could be used. Or, if 

5 the TDA system had more or less than four processors, a different assignment could be used as will be discussed. 

The timing control for the TDA system is provided by the instruction firing times, that is, the system is time-driven. 
As can be observed in Figures 7a through 7c, during each individual instruction firing time, the TDA system architecture 
composed of the processor elements 640 and the PE-register set file network 670, takes on a new and unique partic- 
ular configuration fully adapted to enable the individual processor elements to concurrently process instructions while 

w making full use of all the available resources. The processor elements can be context free and thereby data, condition, 
or information relating to past processing is not required, nor does it exist, internally to the processor element. The con- 
text free processor elements react only to the requirements of each individual instruction and are interconnected by the 
hardware to the necessary shared registers. 

75 4. Summary 

In summary, the TOLL software 1 10 for each different program or compiler output 100 analyzes the natural concur- 
rencies existing in each single entry, single exit (SESE) basic block (BB) and adds intelligence, including in one illus- 
trated embodiment, a logical processor number (LPN) and an instruction firing time (IFT), to each instruction. In an 

20 MIMD system as shown in Figure 6, each context file would contain data from a different user executing a program. 
Each user is assigned a different context file and, as shown in Figure 7, the processor elements (PEs) are capable of 
individually accessing the necessary resources such as registers and condition codes storage required by the instruc- 
tion. The instruction itself carries the shared resource information (that is, the registers and condition code storage). 
Hence, the TOLL software statically allocates only once for each program the necessary information for controlling the 

25 processing of the instruction in the TDA system architecture illustrated in Figure 6 to insure a time-driven decentralized 
control wherein the memory, the logical resource drivers, the processor elements, and the context shared resources are 
totally coupled through their respective networks in a pure, non-blocking fashion. 

The logical resource drivers (LRDs) 620 receive the basic blocks formed in an execution set and are responsible 
for delivering each instruction to the selected processor element 640 at the instruction firing time (IFT). While the exam- 

30 pie shown in Figure 7 is a simplistic representation for a single user, it is to be expressly understood that the delivery by 
the logical resource driver 620 of the instructions to the processor elements 640, in a multi-user system, makes full use 
of the processor elements as will be fully discussed subsequently. Because the timing and the identity of the shared 
resources and the processor elements are all contained within the extended intelligence added to the instructions by 
the TOLL software, each processor element 640 can be completely (or in some instances substantially) context free 

35 and, in fact, from instruction firing time to instruction firing time can process individual instructions of different users as 
delivered by the various logical resource drivers. As will be explained, in order to do this, the logical resource drivers 
620, in a predetermined order, deliver the instructions to the processor elements 640 through the PE-LRD network 650. 

It is the context free nature of the processor elements which allows the independent access by any processor ele- 
ment of the results of data generation/manipulation from any other processor element following the completion of each 

40 instruction execution. In the case of processors which are not context free, in order for one processor to access data 
created by another, specific actions (usually instructions which move data from general purpose registers to memory) 
are required in order to extract the data from one processor and make it available to another. 

It is also the context free nature of the processor elements that permits the true sharing of the processor elements 
by multiple LRDs. This sharing can be as fine-grained as a single instruction cycle. No programming or special proces- 

45 sor operations are needed to save the state of one context (assigned to one LRD), which has control of one or more 
processor elements, in order to permit control by another context (assigned to a second LRD). In processors which are 
not context free, which is the case for the prior art, specific programming and special machine operations are required 
in such state-saving as part of the process of context switching. 

There is one additional alternative in implementing the processor elements, which is a modification to the context 

50 free concept: an implementation which provides the physically total interconnection discussed above, but which per- 
mits, under program control, a restriction upon the transmission of generated data to the register file following comple- 
tion of certain instructions. 

In a fully context free implementation, at the completion of each instruction which enters the processor element, the 
state of the context is entirely captured in the context storage file. In the alternative case, transmission to the register 
55 file is precluded and the data is retained within the processor and made available (for example, through data chaining) 
to succeeding instructions which further manipulate the data. Ultimately, data is transmitted to the register file after 
some finite sequence of instructions completes; however, it is only the final data that is transmitted. 

This can be viewed as a generalization of the case of a microcoded complex instruction as described above, and 
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can be considered a substantially context free processor element implementation. In such an implementation, the TOLL 
software would be required to ensure that dependent instructions execute on the same processor element until such 
time as data is ultimately transmitted to the context register file. As with pipelined processor elements, this does not 
change the overall functionality and architecture of the TOLL software, but mainly affects the efficient scheduling of 
5 instructions among processor elements to make optimal use of each instruction cycle on all processor elements. 

DETAILED DESCRIPTION 

1. Detailed Description of Software 

10 

In Figures 8 through 11 , the details of the TOLL software 1 10 of the present invention are set forth. Referring to 
Figure 8, the conventional output from a compiler is delivered to the TOLL software at the start stage 800. The following 
information is contained within the conventional compiler output 800: (a) instruction functionality, (b) resources required 
by the instruction, (c) locations of the resources (if possible), and (d) basic block boundaries. The TOLL software then 

is starts with the first instruction at stage 810 and proceeds to determine "which" resources are used in stage 820 and 
"how" the resources are used in stage 830. This process continues for each instruction within the instruction stream 
through stages 840 and 850 as was discussed in the previous section. 

After the last instruction is processed, as tested in stage 840, a table is constructed and initialized with the "free 
time" and "load time" for each resource. Such a table is set forth in Table 7 for the inner loop matrix multiply example 

20 and at initialization, the table contains all zeros. The initialization occurs in stage 860 and once constructed the TOLL 
software proceeds to start with the first basic block in stage 870. 



TABLE 7 



30 



35 



Resource 


Load Time 


Free Time 


R0 


TO 


TO 


R1 


TO 


TO 


R2 


TO 


TO 


R3 


TO [ 


TO 


R4 


TO 


TO 


R10 


TO 


TO 


R11 


TO 


TO 



Referring to Figure 9, the TOLL software continues the analysis of the instruction stream with the first instruction of 
the next basic block in stage 900. As stated previously, TOLL performs a static analysis of the instruction stream. Static 
40 analysis assumes (in effect) straight line code, that is, each instruction is analyzed as it is seen in a sequential manner. 
In other words, static analysis assumes that a branch is never taken. For non-pipelined instruction execution, this is not 
a problem, as there will never be any dependencies that arise as a result of a branch. Pipelined execution is discussed 
subsequently (although, it can be stated that the use of pipelining will only affect the delay value of the branch instruc- 
tion). 

45 Clearly, the assumption that a branch is never taken is incorrect. However, the impact of encountering a branch in 
the instruction stream is straightforward. As stated previously, each instruction is characterized by the resources (or 
physical hardware elements) it uses. The assignment of the firing time (and in the illustrated embodiment, the logical 
processor number) is dependent on how the instruction stream accesses these resources. Within this particular 
embodiment of the TOLL software, the usage of each resource is represented, as noted above, by data structures 

so termed the free and load times for that resource. As each instruction is analyzed in sequence, the analysis of a branch 
impacts these data structures in the following manner. 

When ail of the instructions of a basic block have been assigned firing times, the maximum firing time of the current 
basic block (the one the branch is a member of) is used to update all resources load and free times (to this value). When 
the next basic block analysis begins, the proposed firing time is then given as the last maximum value plus one. Hence, 

55 the load and free times for each of the register resources R0 through R4. R10 and R1 1 are set forth below in Table 8^ 
for the example, assuming the basic block commences with a time of T1 6. 



BNSCKDCID: <EP 084021 3A2J_> 



14 



EP0 840 213 A2 



TABLE 8 



10 



15 



Resource 


Load Time 


Free Time 


RO 


T15 


T15 


R1 


T15 


T15 


R2 


T15 


T15 


R3 


T15 


T15 


R4 


T15 


T15 


R10 


T15 


T15 


R11 


T15 


T15 



Hence, the TOLL software sets a proposed firing time (PFT) in stage 910 to the maximum firing time plus one of 
the previous basic blocks firing times. In the context of the above example, the previous basic block's last firing time is 
T15, and the proposed firing time for the instructions in this basic block commence with T16. 

20 In stage 920, the first resource used by the first instruction, which in this case is register RO of instruction 10, is ana- 
lyzed. In stage 930, a determination is made as to whether or not the resource is read. In the above example, for instruc- 
tion 10, register RO is not read but is written and, therefore, stage 940 is next entered to make the determination of 
whether or not the resource is written. In this case, instruction 10 writes into register RO and stage 942 is entered. Stage 
942 makes a determination as to whether the proposed firing time (PFT) for instruction 10 is less than or equal to the 

25 free time for the resource. In this case, referring to Table 8, the resource free time for register RO is T15 and, therefore, 
the instruction proposed firing time of T16 is greater than the resource free time of T15 and the determination is "no" 
and stage 950 is accessed. 

The analysis by the TOLL software proceeds to the next resource which in the case, for instruction 10, is register 
R10. This resource is both read and written by the instruction. Stage 930 is entered and a determination is made as to 

30 whether or not the instruction reads the resource. It does, so stage 932 is entered where a determination is made as to 
whether the current proposed firing time for the instruction (T16) is less than the resource load time (T15). It is not, so 
stage 940 is entered. Here a determination is made as to whether the instruction writes the resource. It does; so stage 
942 is entered. In this stage a determination is made as to whether the proposed firing time for the instruction (T16) is 
less than the free time for the resource (T15). It is not, and stage 950 is accessed. The analysis by the TOLL software 

35 proceeds either to the next resource (there is none for instruction 10) or to "B" (Figure 10) if the last resource for the 
instruction has been processed. 

Hence, the answer to the determination at stage 950 is affirmative and the analysts then proceeds to Figure 10. In 
Figure 1 0, the resource free and load times will be set. At stage 1 000, the first resource for instruction 10 is register R0. 
The first determination in stage 1010 is whether or not the instruction reads the resource. As before, register R0 in 

40 instruction 10 is not read but written and the answer to this determination is "no" in which case the analysis then pro- 
ceeds to stage 1020. In stage 1020, the answer to the determination as to whether or not the resource is written is "yes" 
and the analysis proceeds to stage 1 022. Stage 1022 makes the determination as to whether or not the proposed firing 
time for the instruction is greater than the resource load time. In the example, the proposed firing time is T16 and, with 
reference back to Table 8, the firing time T16 is greater than the load time T15 for register R0. Hence, the response to 

45 this determination is "yes" and stage 1024 is entered. In stage 1024, the resource load time is set equal to the instruc- 
tion's proposed firing time and the table of resources (Table 8) is updated to reflect that change. Likewise, stage 1026 
is entered and the resource free-time is updated and set equal to the instruction's proposed firing time plus one or T1 7 
(Tl6plus one). 

Stage 1030 is then entered and a determination made as to whether there are any further resources used by this 
so instruction. There is one, register R1 0, and the analysis processes this resource. The next resource is acquired at stage 
1070. Stage 1010 is then entered where a determination is made as to whether or not the resource is read by the 
instruction. It is and so stage 1012 is entered where a determination is made as to whether the current proposed firing 
time (T16) is greater than the resource's free-time (T15). It is, and therefore stage 1014 is entered where the resource's 
free-time is updated to reflect the use of this resource by this instruction. The method next checks at stage 1020 
55 whether the resource is written by the instruction. It is, and so stage 1022 is entered where a determination is made as 
to whether or not the current proposed firing time (T1 6) is greater than the load time of the resource (T1 5) . It is, so stage 
1024 is entered. In this stage, the resource's load-time is updated to reflect the firing time of the instruction, that is, the 
load-time is set to T16. Stage 1026 is then entered where the resource's free-time is updated to reflect the execution of 
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the instruction, that is, the free-time is set to T17. Stage 1030 is then entered where a determination is made as to 
whether or not this is the last resource used by the instruction. It is, and therefore, stage 1040 is entered. The instruction 
firing time (IFT) is now set to equal the proposed firing time (PFT) of T1 6. Stage 1050 is then accessed which makes a 
determination as to whether or not this is the last instruction in the basic block, which in this case is M no"; and stage 1 060 
is entered to proceed to the next instruction, 11 , which enters the analysis stage at "A1 " of Figure 9. 

The next instruction in the example is 11 and the identical analysis is had for instruction 11 for registers R1 and R1 1 
as presented for instruction 10 with registers R0 and R10. In Table 9 below, a portion of the resource Table 8 is modified 
to reflect these changes. (Instructions 10 and 11 have been fully processed by the TOLL software.) 

TABLE 9 



Resource 


Load Time 


Free Time 


R0 


T16 


T17 


R1 


T16 


T17 


R10 


T16 


T17 


R11 


T16 


T17 



The next instruction in the basic block example is instruction 12 which involves a read of registers R0 and R1 and a 
write into register R2. Hence, in stage 91 0 of Figure 9, the proposed firing time for the instruction is set to T1 6 (T1 5 plus 
1). Stage 920 is then entered and the first resource in instruction 12 is register R0. The first determination made in stage 
930 is "yes" and stage 932 is entered. At stage 932, a determination is made whether the instruction's proposed firing 
time of T16 is less than or equal to the resource register R0 load time of T16. It is important to note that the resource 
load time for register R0 was updated during the analysis of register R0 for instruction 10 from time T15 to time T16. The 
answer to this determination in stage 932 is that the proposed firing time equals the resource load time (T16 equals 
T16) and stage 934 is entered. In stage 934, the instruction proposed firing time is updated to equal the resource load 
time plus one or, in this case, T17 (T16 plus one). The instruction 12 proposed firing time is now updated to T17. Now 
stage 940 is entered and since instruction 12 does not write resource R0, the answer to the determination is "no" and 
stage 950 and then stage 960 are entered to process the next resource which in this case is register R1 . 

Stage 960 initiates the analysis to take place for register R1 and a determination is made in stage 930 whether or 
not the resource is read. The answer, of course, is "yes" and stage 932 is entered. This time the instruction proposed 
firing time is T1 7 and a determination is made whether or not the instruction proposed firing time of T1 7 is less than or 
equal to the resource load time for register R1 which is T16. Since the instruction proposed firing time is greater than 
the register load time (T1 7 is greater than T16), the answer to this determination is "no" and stage 940. The register is 
not written by this instruction and, therefore, the analysis proceeds to stage 950. The next resource to be processed for 
instruction 12, in stage 960, is resource register R2. 

The first determination of stage 930 is whether or not this resource R2 is read. It is not and hence the analysis 
moves to stage 940 and then to stage 942. At this point in time the instruction 12 proposed firing time is T1 7 and in stage 
942 a determination is made whether or not the instruction's proposed firing time of T17 is less than or equal to 
resources, R2 free time which in Table 8 above is T1 5. The answer to this determination is "no" and therefore stage 950 
is entered. This is the last resource processed for this instruction and the analysis continues in Figure 10. 

Referring to Figure 10, the first resource R0 for instruction 12 is analyzed. In stage 1010, the determination is made 
whether or not this resource is read and the answer is "yes." Stage 1012 is then entered to make the determination 
whether the proposed firing time T1 7 of instruction 12 is greater than the resource free-time for register R0. In Table 9, 
the free-time for register R0 is T1 7 and the answer to the determination is "no" since both are equal. Stage 1020 is then 
entered which also results in a "no" answer transferring the analysis to stage 1030. Since this is not the last resource 
to be processed for instruction 12, stage 1070 is entered to advance the analysis to the next resource register R1 . Pre- 
cisely the same path through Figure 10 occurs for register R1 as for register R0. Next, stage 1070 initiates processing 
of register R2. In this case, the answer to the determination at stage 1010 is "no" and stage 1020 is accessed. Since 
register R2 for instruction 12 is written, stage 1022 is accessed. In this case, the proposed firing time of instruction 12 is 
T17 and the resource load-time is T15 from Table 8. Hence, the proposed firing time is greater than the load time and 
stage 1024 is accessed. Stages 1 024 and 1026 cause the load tame and the free time for register R2 to be advanced 
respectively, to T1 7 and T18, and the resource table is updated as shown in Figure 10: 
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TABLE 10 



5 



Resource 


Load -Time 


Free-Time 


R0 


T16 


T17 


R1 


T16 


T17 


R2 


T17 


T18 



10 

As this is the last resource processed, for instruction 12, the proposed firing time of T1 7 becomes the actual firing time 
(stage 1040) and the next instruction is analyzed. 

It is in this fashion that each of the instructions in the inner loop matrix multiply example are analyzed so that when 
fully analyzed the final resource table appears as in Table 1 1 below: 



TABLE 1 1 



20 



25 



Resource 


Load-Time 


Free-Time 


R0 


T16 


T17 


R1 


T16 


T17 


R2 


T17 


T18 


R3 


T18 


T19 


R4 


T16 


T17 


R10 


T16 


T17 


R11 


T16 


T17 



30 

Referring to Figure 1 1 , the TOLL software, after performing the tasks set forth in Figures 9 and 10, enters stage 
1 1 00. Stage 1 1 00 sets all resource free and load times to the maximum of those within the given basic block. For exam- 
ple, the maximum time set forth in Table 1 1 is T1 9 and, therefore, all free and load times are set to time T1 9. Stage 11 1 0 
is then entered to make the determination whether this is the last basic block to be processed. If not, stage 1120 is 

35 entered to proceed with the next basic block. If this is the last basic block, stage 1 130 is entered and starts again with 
the first basic block in the instruction stream. The purpose of this analysis is to logically reorder the instructions within 
each basic block and to assign logical processor numbers to each instruction. This is summarized in Table 6 for the 
inner loop matrix multiply example. Stage 1140 performs the function of sorting the instruction in each basic block in 
ascending order using the instruction firing time (IFT) as the basis. Stage 1 150 is then entered wherein the logical proc- 

40 essor numbers (LPNs) are assigned. In making the assignment of the processor elements, the instructions of a set, that 
is those having the same instruction firing time (IFT), are assigned logical processor numbers on a first come, first serve 
basis. For example, in reference back to Table 6, the first set of instructions for firing time T1 6 are 10, 11 , and 14. These 
instructions are assigned respectively to processors PE0, PE1, and PE2. Next, during time T17, the second set of 
instructions 12 and 15 are assigned to processors PE0 and PE1, respectively. Finally, during the final time T18, the final 

45 instruction 13 is assigned to processor PE0. It is to be expressly understood that the assignment of the processor ele- 
ments could be effected using other methods and is based upon the actual architecture of the processor element and 
the system. As is clear, in the preferred embodiment the set of instructions are assigned to the logical processors on a 
first in time basis. After making the assignment, stage 1 160 is entered to determine whether or not the last basic block 
has been processed and if not, stage 1 170 brings forth the next basic block and the process is repeated until finished. 

so Hence, the output of the TOLL software, in this illustrated embodiment, results in the assignment of the instruction 
firing time (IFT) for each of the instructions as shown in Figure 4. As previously discussed, the instructions are reor- 
dered, based upon the natural concurrencies appearing in the instruction stream, according to the instruction firing 
times; and, then, individual logical processors are assigned as shown in Table 6. While the discussion above has con- 
centrated on the inner loop matrix multiply example, the analysis set forth in Figures 9 through 1 1 can be applied to any 

55 SESE basic block (BB) to detect the natural concurrencies contained therein and then to assign the instruction firing 
times (IFTs) and the logical processor numbers (LPNs) for each user s program. This intelligence can then be added to 
the reordered instructions within the basic block. This is only done once for a given program and provides the necessary 
time-driven decentralized control and processor mapping information to run on the TDA system architecture of the 
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present invention. 

The purpose of the execution sets, referring to Figure 12, is to optimize program execution by maximizing instruc- 
tion cache hits within an execution set or, in other words, to statically minimize transfers by a basic block within an exe- 
cution set to a basic block in another execution set. Support of execution sets consists of three major components: data 

s structure definitions, pre-execution time software which prepares the execution set data structures, and hardware to 
support the fetching and manipulation of execution sets in the process of executing the program. 

The execution set data structure consists of a set of one or more basic blocks and an attached header. The header 
contains the following information: the address 1200 of the start of the actual instructions (this is implicit if the header 
has a fixed length), the length 1210 of the execution set (or the address of the end of the execution set), and zero or 

10 more addresses 1220 of potential successor (in terms of program execution) execution sets. 

The software required to support execution sets manipulates the output of the post-compile processing. That 
processing performs dependency analysis, resource analysis, resource assignment, and individual instruction stream 
reordering. The formation of execution sets uses one or more algorithms for determining the probable order and fre- 
quency of execution of the basic blocks. "Rie basic blocks are grouped accordingly. The possible algorithms are similar 

75 to the algorithms used in solving linear programming problems for least-cost routing. In the case of execution sets, cost 
is associated with branching. Branching between basic blocks contained in the same execution set incurs no penalty 
with respect to cache operations because it is assumed that the instructions for the basic blocks of an execution set are 
resident in the cache in the steady state. Cost is then associated with branching between basic blocks of different exe- 
cution sets, because the instructions of the basic blocks of a different execution set are assumed not to be in cache. 

20 Cache misses delay program execution while the retrieval and storage of the appropriate block from main memory to 
cache is made. 

There are several possible algorithms which can be used to assess and assign costs. One algorithm is the static 
branch cost approach. In accordance with this method, one begins by placing basic blocks into execution sets based 
on block contiguity and a maximum allowable execution set size (this would be an implementation limit, such as maxi- 

25 mum instruction cache size). The information about branching between basic blocks is known and is an output of the 
compiler. Using this information, the apparatus calculates the "cost" of the resulting grouping of basic blocks into exe- 
cution sets based on the number of (static) branches between basic blocks in different execution sets. The apparatus 
can then use standard linear programming techniques to minimize this cost function, thereby obtaining the "optimal" 
grouping of basic blocks into execution sets. This algorithm has the advantage of ease of implementation; however, it 

30 ignores the actual dynamic branching patterns which occur during actual program execution. 

Other algorithms could be used which provide a better estimation of actual dynamic branch patterns. One example 
would be the collection of actual branch data from a program execution, and the resultant regrouping of the basic blocks 
using a weighted assignment of branch costs based on the actual inter-block branching. Clearly, this approach is data 
dependent. Another approach would be to allow the programmer to specify branch probabilities, after which the 

35 weighted cost assignment would be made. This approach has the disadvantages of programmer intervention and pro- 
grammer error. Still other approaches would be based using parameters, such as limiting the number of basic blocks 
per execution set, and applying heuristics to these parameters. 

The algorithms described above are not unique to the problem of creating execution sets. However, the use of exe- 
cution sets as a means of optimizing instruction cache performance is novel. Like the novelty of pre-execution time 

40 assignment of processor resources, the pre-execution time grouping of basic blocks for maximizing cache performance 
is not found in prior art. 

The final element required to support the execution sets is the hardware. As will be discussed subsequently, this 
hardware includes storage to contain the current execution set starting and ending addresses and to contain the other 
execution set header data. The existence of execution sets and the associated header data structures are, in fact, trans- 

45 parent to the actual instruction fetching from cache to the processor elements. The latter depends strictly upon the indi- 
vidual instruction and branch addresses. The execution set hardware operates independently of instruction fetching to 
control the movement of instruction words from main memory to the instruction cache. This hardware is responsible for 
fetching basic blocks of instructions into the cache until either the entire execution set resides in cache or program exe- 
cution has reached a point that a branch has occurred to a basic block outside the execution set. At this point, since the 

so target execution set is not resident in cache, the execution set hardware begins fetching the basic blocks belonging to 
the target execution set. 

Referring to Figure 13, the structure of the register set file 660 for context file zero (the structure being the same for 
each context file) has L+1 levels of register sets with each register set containing n+1 separate registers. For example, 
n could equal 31 for a total of 32 registers. Likewise, the L could equal 15 for a total of 16 levels. Note that these regis- 
55 ters are not shared between levels; that is. each level has a set of registers which is physically distinct from the registers 
of each other level. 

Each level of registers corresponds to that group of registers available to a subroutine executing at a particular 
depth relative to the main program. For example, the set of registers at level zero can be available to the main program; 
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the set of registers at level one can be available to a first level subroutine that is called directly from the main program; 
the set of registers at level two can be available to any subroutine (a second level subroutine) called directly by a first 
level subroutine; the set of registers at level three can be available to any subroutine called directly by a second level 
subroutine; and so on. 

5 As these sets of registers are independent, the maximum number of levels corresponds to the number of subrou- 

tines that can be nested before having to physically share any registers between subroutines, that is, before having to 
store the contents of any registers in main memory. The register sets, in their different levels, constitute a shared 
resource and significantly saves system overhead during subroutine calls since only rarely do sets of registers need to 
be stored, for example in a stack, in memory. 

10 Communication between different levels of subroutines takes place, in the preferred illustrated embodiment, by 
allowing each subroutine up to three possible levels from which to obtain a register: the current level, the previous (call- 
ing) level (if any) and the global (main program) level. The designation of which level of registers is to be accessed, that 
is, the level relative to the presently executing main program or subroutine, uses the static SCSM information attached 
to the instruction by the TOLL software. This information designates a level relative to the instruction to be processed. 

is This can be illustrated by a subroutine call for a SINE function that takes as its argument a value representing an angu- 
lar measure and returns the trigonometric SINE of that measure. The main program is set forth in Table 12; and the sub- 
routine is set forth in Table 13. 



20 




TABLE 12 




Main Program 


Purpose 




LOAD X, R1 


Load X from memory into Reg R1 for parameter passing 


25 


CALL SINE 


Subroutine Call -Returns result in Reg R2 


LOAD R2, R3 


Temporarily save results in Reg R3 




LOAD Y, R1 


Load Y from memory into Reg R1 for parameter passing 




CALL SINE 


Subroutine Call -Returns result in Reg R2 


30 


MULT R2, R3, R4 


Multiply Sin (x) with Sin (y) and store result in Reg R4 




STORE R4, 2 


Store final result in memory at Z 



The SINE subroutine is set forth in Table 13: 

35 

TABLE 13 



Instruction 


Subroutine 


Purpose 


10 

lp-1 

"P 


Load R1(L0), R2 
(Perform SINE), R7 
Load R7, R2(L0) 


Load Reg R2, level 1 with contents of Reg R1 , level 0 
Calculate SINE function and store result in Reg R7, level 1 
Load Reg R2. level 0 with contents of Reg R7. level 1 



45 Hence, with reference to Figure 14, instruction 10 of the subroutine loads register R2 of the current level (the sub- 
routine's level or called level) with the contents of register R1 from the previous level (the calling routine or level). Note 
that the subroutine has a full set of registers with which to perform the processing independent of the register set of the 
calling routine. Upon completion of the subroutine call, instruction Ip causes register R7 of the current level to be stored 
in register R2 of the calling routine's level (which returns the results of the SINE routine back to the calling program's 

so register set). 

As described in more detail in connection with Figure 22, the transfer between the levels occurs through the use of 
the SCSM dynamically generated information which can contain the absolute value of the current procedural level of 
the instruction (that is, the level of the called routine), the previous procedural level (that is, the level of the calling rou- 
tine) and the context identifier. The absolute dynamic SCSM level information is generated by the LRD from the relative 
55 (static) SCSM information provided by the TOLL software. The context identifier is only used when processing a number 
of programs in a multi-user system. The relative SCSM information is shown in Table 13 for register R1 (of the calling 
routine) as R1(L0) and for register R2 as R2(L0). All registers of the current level have appended an implied (00) signi- 
fying the current procedural level. 
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This method and structure described in connection with Figures 13 and 14 differ substantially from prior art 
approaches where physical sharing of the same registers occurs between registers of a subroutine and its calling rou- 
tine. By thereby limiting the number of registers that are available for use by the subroutine, more system overhead for 
storing the registers in main memory is required. See, for example, the MIPS approach as set forth in "Reduced Instruc- 

5 tion Set Computers" David A. Patterson, Communications of the ACM, January, 1985, Vol. 28, No. 1 , Pgs. 8-21 . In that 
reference, the first sixteen registers are local registers to be used solely by the subroutine, the next eight registers, reg- 
isters 16 through 23, are shared between the calling routine and the subroutine, and final eight registers, registers 24 
through 31 are shared between the global (or main) program and the subroutine. Clearly, out of 32 registers that are 
accessible by the subroutine, only 1 6 are dedicated solely for use by the subroutine in the processing of its program. In 

10 the processing of complex subroutines, the limited number of registers that are dedicated solely to the subroutine may 
not (in general) be sufficient for the processing of the subroutine. Data shuffling (entailing the storing of intermediate 
data in memory) must then occur, resulting in significant overhead in the processing of the routine. 

The relative transfers between the levels which are known to occur at compile time are specified by adding the req- 
uisite information to the register identifiers as shown in Figure 4 (the SCSM data), to appropriately map the instructions 

15 between the various levels. Hence, a completely independent set of registers is available to the calling routine and to 
each level of subroutine. The calling routine, in addition to accessing its own complete set of registers, can also gain 
direct access to a higher set of registers using the aforesaid static SCSM mapping code which is added to the instruc- 
tion, as previously described. There is literally no reduction in the size of the register set available to a subroutine as 
specifically found in prior art approaches. Furthermore, the mapping code for the SCSM information can be a field of 

20 sufficient length to access any number of desired levels. For example, in one illustrated embodiment, a calling routine 
can access up to seven higher levels in addition to its own registers with a field of three bits. The present invention is 
not to be limited to any particular number of levels nor to any particular number of registers within a level. The mapping 
shown in Figure 14 is a logical mapping and not a conventional physical mapping. For example, three levels, such as 
the calling routine level, the called level, and the global level require three bit maps. The relative identification of the lev- 

25 els can be specified by a two bit word in the static SCSM, for example, the calling routine by (00), the subordinate level 
by (01), and the global level by (11). Thus, each user's program is analyzed and the static SCSM relative procedural 
level information, also designated a window code, is added to the instructions prior to the issuance of the user program 
to a specific LRD. Once the user is assigned to a specific LRD. the static SCSM level informatin is used to generate the 
LRD dependent and dynamic SCSM information which is added as it is needed. 

30 

2. Detailed Description of the Hardware 

As shown in Figure 6, the TDA system 600 of the present invention is composed of memory 610, logical resource 
drivers (LRD) 620, processor elements (PEs) 640, and shared context storage files 660. The following detailed descrip- 
35 tion starts with the logical resource drivers since the TOLL software output is loaded into this hardware. 

a. Logical Resource Drivers (LRDs) 

The details of a particular logical resource driver (LRD) is set forth in Figure 15. As shown in Figure 6, each logical 

40 resource driver 620 is interconnected to the LRD-memory network 630 on one side and to the processor elements 640 
through the PE-LRD network 650 on the other side. If the present system were a SIMD machine, then only one LRD is 
provided and only one context file is provided. For MIMD capabilities, one LRD and one context file is provided for each 
user so that, in the embodiment illustrated in Figure 6, up to "n" users can be accommodated. 

The logical resource driver 620 is composed of a data cache section 1500 and an instruction selection section 

45 1510. In the instruction selection section, the following components are interconnected. An instruction cache address 
translation unit (ATU) 1512 is interconnected to the LRD-memory network 630 over a bus 1514. The instruction cache 
ATU 1512 is further interconnected over a bus 1516 to an instruction cache control circuit 1518. The instruction cache 
control circuit 1518 is interconnected over lines 1520 to a series of cache partitions 1522a, 1522b, 1522c, and 1522d. 
Each of the cache partitions is respectively connected over busses 1524a, 1524b, 1524c, and 1524d to the LRD-mem- 

50 ory network 630. Each cache partition circuit is further interconnected over lines 1536a, 1536b, 1536c, and 1536d to a 
processor instruction queue (PIQ) bus interface unit 1 544. The PIQ bus interface unit 1544 is connected over lines 1546 
to a branch execution unit (BEU) 1548 which in turn is connected over lines 1550 to the PE-context file network 670. 
The PIQ bus interface unit 1 544 is further connected over lines 1 552a, 1 552b, 1 552c, and 1 552d to a processor instruc- 
tion queue (PIQ) buffer unit 1560 which in turn is connected over lines 1562a. 1562b, 1562c, and 1562d to a processor 

55 instruction queue (PIQ) processor assignment circuit 1570. The PIQ processor assignment circuit 1570 is in turn con- 
nected over lines 1572a, 1572b, 1572c, and 1572d to the PE-LRD network 650 and hence to the processor elements 
640. 

On the data cache portion 1500, a data cache ATU 1580 is interconnected over bus 1582 to the LRD-memory net- 
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work 630 and is further interconnected over bus 1 584 to a data cache control circuit 1 586 and over lines 1 588 to a data 
cache interconnection network 1590. The data cache control 1586 is also interconnected to data cache partition circuits 
1592a, 1592b, 1592c and 1592d over lines 1593. Thedata cache partition circuits, in turn, are interconnected over lines 
1594a, 1594b, 1594c, and 1594d to the LRD-memory network 630. Furthermore, the data cache partition circuits 1592 

5 are interconnected over lines 1 596a, 1 596b, 1 596c, and 1 596d to the data cache interconnection network 1 590. Finally, 
the data cache interconnection network 1590 is interconnected over lines 1598a, 1598b, 1598c, and 1598d to the PE- 
LRD network 650 and hence to the processor elements 640. 

In operation, each logical resource driver (LRD) 620 has two sections, the data cache portion 1 500 and the instruc- 
tion selection portion 1510. The data cache portion 1500 acts as a high speed data buffer between the processor ele- 

w ments 640 and memory 610. Note that due to the number of memory requests that must be satisfied per unit time, the 
data cache 1500 is interleaved. All data requests made to memory by a processor element 640 are issued on the data 
cache interconnection network 1590 and intercepted by the data cache 1592. The requests are routed to the appropri- 
ate data cache 1592 by the data cache interconnection network 1590 using the context identifier that is part of the 
dynamic SCSM information attached by the LRD to each instruction that is executed by the processors. The address of 

15 the desired datum determines in which cache partition the datum resides. If the requested datum is present (that is, a 
data cache hit occurs), the datum is sent back to the requesting processor element 640. 

If the requested datum is not present in data cache, the address delivered to the cache 1592 is sent to the data 
cache ATU 1580 to be translated into a system address. The system address is then issued to memory. In response, a 
block of data from memory (a cache line or block) is delivered into the cache partition circuits 1 592 under control of data 

20 cache control 1586. The requested data, that is resident in this cache block, is then sent through the data cache inter- 
connection network 1 590 to the requesting processor element 640. It is to be expressly understood that this is only one 
possible design. The data cache portion is of conventional design and many possible implementations are realizable to 
one skilled in the art. As the data cache is of standard functionality and design, it will not be discussed further. 

The instruction selection portion 1510 of the LRD has three major functions; instruction caching, instruction queue- 

25 ing and branch execution. The system function of the instruction cache portion of selection portion 1 51 0 is typical of any 
instruction caching mechanism. It acts as a high speed instruction buffer between the processors and memory. How- 
ever, this specification presents methods and an apparatus configuration for realizing this function that are unique. 

One purpose of the instruction portion 1510 is to receive execution sets from memory, place the sets into the 
caches 1522 and furnish the instructions within the sets, on an as needed basis, to the processor elements 640. As the 

30 system contains multiple, generally independent, processor elements 640, requests to the instruction cache are for a 
group of concurrently executable instructions. Again, due to the number of requests that must be satisfied per unit time, 
the instruction cache is interleaved. The group size ranges from none to the number of processors available to the 
users. The groups are termed packets, although this does not necessarily imply that the instructions are stored in a con- 
tiguous manner. Instructions are fetched from the cache on the basis of their instruction firing time (IFT). The next 

35 instruction firing time register contains the firing time of the next packet of instructions to be fetched. This register may 
be loaded by the branch execution unit 1548 of the LRD as well as incremented by the cache control unit 1518 when 
an instruction fetch has been completed. 

The next IFT register (NIFTR) is a storage register that is accessible from the context control unit 1518 and the 
branch execution unit 1 548. Due to its simple functionality, it is not explicitly shown. Technically, it is a part of the instruc- 

40 tion cache control unit 1518, and is further buried in the control unit 1660 (Figure 16). The key point here is that the 
NIFTR is merely a storage register which can be incremented or loaded. 

The instruction cache selection portion 1510 receives the instructions of an execution set from memory over bus 
1524 and, in a round robin manner, places instructions word into each cache partitions, 1522a, 1522b, 1522c and 
1 522d. In other words, the instructions in the execution set are directed so that the first instruction is delivered to cache 

45 partition 1522a, the second instruction to cache partition 1522b, the third instruction to cache partition 1522c, and the 
fourth instruction to cache partition 1522d. The fifth instruction is then directed to cache partition 1522a, and so on until 
all of the instructions in the execution set are delivered into the cache partition circuits. 

All the data delivered to the cache partitions are not necessarily stored in the cache. As will be discussed, the exe- 
cution set header and trailer may not be stored. Each cache partition attaches a unique identifier (termed a tag) to all 

so the information that is to be stored in that cache partition. The identifier is used to verify that information obtained from 
the cache is indeed the information desired. 

When a packet of instructions is requested, each cache partition determines if the partition contains an instruction 
that is a member of the requested packet. If none of the partitions contain an instruction that is a member of the 
requested packet (that is. a miss occurs), the execution set that contains the requested packet is requested from mem- 

55 ory in a manner analogous to a data cache miss. 

If a hit occurs (that is, at least one of the partitions 1522 contains an instruction from the requested packet), the 
partition(s) attach any appropriate dynamic SCSM information to the instruction(s). The dynamic SCSM information, 
which can be attached to each instruction, is important for multi-user applications. The dynamically attached SCSM 
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information identifies the context file (see Figure 6) assigned to a given user. Hence, under the teachings of the present 
invention, the system 600 is capable of delay free switching among many user context files without requiring a master 
processor or access to memory. 

The instruction(s) are then delivered to the PIQ bus interface unit 1544 of the LRD 620 where it is routed to the 

5 appropriate PIQ buffers 1560 according to the logical processor number (LPN) contained in the extended intelligence 
that the TOLL software, in the illustrated embodiment, has attached to the instruction. The instructions in the PIQ buffer 
unit 1560 are buffered for assignment to the actual processor elements 640. The processor assignment is performed 
by the PIQ processor assignment unit 1570. The assignment of the physical processor elements is performed on the 
basis of the number of processor elements currently available and the number of instructions that are available to be 

10 assigned. These numbers are dynamic. The selection process is set forth below. 

The details of the instruction cache control 1518 and of each cache partition 1 522 of Figure 15 are set forth in Fig- 
ure 16. In each cache partition circuit 1522, five circuits are utilized. The first circuit is the header route circuit 1600 
which routes an individual word in the header of the execution set over a path 1520b to the instruction cache context 
control unit 1 660. The control of the header route circuit 1 600 is effected over path 1 520a by the header path select cir- 

15 cuit 1602. The header path select circuit 1602 based upon the address received over lines 1520b from the control unit 
1660 selectively activates the required number of header routers 1600 in the cache partitions. For example, if the exe- 
cution set has two header words, only the first two header route circuits 1600 are activated by the header path select 
circuit 1 602 and therefore two words of header information are delivered over bus 1 520b to the control unit 1 660 from 
the two activated header route circuits 1600 of cache partition circuits 1522a and 1522b (not shown). As mentioned, 

20 successive words in the execution set are delivered to successive cache partition circuits 1522. 

For example, assume that the data of Table 1 represents an entire execution set and that appropriate header words 
appear at the beginning of the execution set. The instructions with the earliest instruction firing times (IFTs) are listed 
first and for a given I FT, those instructions with the lowest logical processor number are listed first. The table reads: 

25 

TABLE 14 
Header Word 1 
Header Word 2 

30 IO(T16)(PEO) 

11 (T16) (PE1) 
I4(T16)(PE2) 

35 I2(T17)(PE0) 

I5(T17)(PE1) 
13 (T18) (PE0) 



40 Hence, the example of Table 1 (that is, the matrix multiply inner loop), now has associated with it two header words and 
the extended information defining the firing time (IFT) and the logical processor number (LPN). As shown in Table 14, 
the instructions were reordered by the TOLL software according to the firing times. Hence, as the execution set shown 
in Table 14 is delivered through the LRD-memory network 630 from memory, the first word (Header Word 1) is routed 
by partition CACHE0 to the control unit 1660. The second word (Header Word 2) is routed by partition CACHE 1 (Fig. 

45 15) to the control unit 1660. Instruction 10 is delivered to partition CACHE2, instruction 11 to partition CACHE3, instruc- 
tion 12 to partition CACHE0, and so forth. As a result, the cache partitions 1522 now contain the instructions as shown 
in Table 15: 



so TABLE 15 



55 



CacheO 


Cachel 


Cache2 


Cache3 






10 


11 


14 


12 


15 


13 



It is important to clarify that the above example has only one basic block in the execution set (that is, it is a simplistic 
example). In actuality, an execution set would have a number of basic blocks. 
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The instructions are then delivered for storage into a cache random access memory (RAM) 1610 resident in each 
cache partition. Each instruction is delivered from the header router 1600 over a bus 1602 to the tag attacher circuit 
1604 and then over a line 1606 into the RAM 1610. The tag attacher circuit 1604 is under control of a tag generation 
circuit 1612 and is interconnected therewith over a line 1520c. Cache RAM 1610 could be a conventional cache high 

5 speed RAM as found in conventional superminicomputers. 

The tag generation circuit 1612 provides a unique identification code (ID) for attachment to each instruction before 
storage of that instruction in the designated RAM 1610. The assigning of process identification tags to instructions 
stored in cache circuits is conventional and is done to prevent aliasing of the instructions. "Cache Memories" by Alan J. 
Smith, ACM Computing Surveys, Vol. 14, September, 1982. The tag comprises a sufficient amount of information to 

w uniquely identify it from each other instruction and user. The illustrated instructions already include the FT and LPN, so 
that subsequently, when instructions are retrieved for execution, they can be fetched based on their firing times. As 
shown in Table 16, below, each instruction containing the extended information and the hardware tag is stored, as 
shown, for the above example: 



TABLE 16 

CACHEO : I4(T16 ) (PE2 ) ( ID2 ) 

CACHE1 : 12 (T17 ) (PEO) ( ID3 ) 

20 

CACHE2 : 10 ( T16 ) ( PEO ) ( IDO ) 
I5(T17) (PE1) ( ID4) 

CACHE 3 : I1(T16) (PE1) ( ID1) 

25 I3(T18)(PE0)(ID5) 



As stated previously, the purpose of the cache partition circuits 1 522 is to provide a high speed buffer between the slow 
main memory 610 and the fast processor elements 640. Typically, the cache RAM 1610 is a high speed memory capa- 

30 ble of being quickly accessed. If the RAM 1610 were a true associative memory, as can be witnessed in Table 16, each 
RAM 1610 could be addressed based upon instruction firing times (IFTs). At the present time, such associative mem- 
ories are not economically justifiable and an IFT to cache address translation circuit 1620 must be utilized. Such a cir- 
cuit is conventional in design and controls the addressing of each RAM 1610 over a bus 1520d. The purpose of circuit 
1620 is to generate the RAM address of the desired instructions given the instruction firing time. Hence, for instruction 

35 firing time T16, CACHEO, CACHE2, and CACHE3, as seen in Table 16, would produce instructions 14, 10, and 11 
respectively. 

When the cache RAMs 1610 are addressed, those instructions associated with a specific firing time are delivered 
over lines 1624 into a tag compare and privilege check circuit 1630. The purpose of the tag compare and privilege 
check circuit 1630 is to compare the hardware tags (ID) to generated tags to verify that the proper instruction has been 

40 delivered. The reference tag is generated through a second tag generation circuit 1632 which is interconnected to the 
tag compare and privilege check circuit 1630 over a line 1520e. A privilege check is also performed on the delivered 
instruction to verify that the operation requested by the instruction is permitted given the privilege status of the process 
(e.g., system program, application program, etc.). This is a conventional check performed by computer processors 
which support multiple levels of processing states. A hit/miss circuit 1 640 determines which RAMs 1610 have delivered 

45 the proper instructions to the PIQ bus interface unit 1 544 in response to a specific instruction fetch request. 

For example, and with reference back to Table 16, rf the RAMs 1610 are addressed by circuit 1620 for instruction 
firing time T16, CACHEO, CACHE2, and CACHE3 would respond with instructions thereby comprising a hit indication 
on those cache partitions. Cache 1 would not respond and that would constitute a miss indication and this would be 
determined by circuit 1640 over line 1520g. Thus, each instruction, for instruction firing time T16, is delivered over bus 

so 1632 into the SCSM attacher 1650 wherein dynamic SCSM information, rf any, is added to the instruction by an SCSM 
attacher hardware 1650. For example, hardware 1650 can replace the static SCSM procedural level information (which 
is a relative value) with the actual procedural level values. The actual values are generataed from a procedural level 
counter data and the static SCSM information. 

When ail of the instructions associated with an individual firing time have been read from the RAM 161 0, the hit and 

55 miss circuit 1640 over lines 1646 informs the instruction cache control unit 1660 of this information. The instruction 
cache context control unit 1660 contains the next instruction firing time register, a part of the instruction cache control 
1518 which increments the instruction firing time to the next value. Hence, in the example, upon the completion of read- 
ing all instructions associated with instruction firing time T16, the instruction cache context control unit 1660 increments 



23 

BNSDOCID: <EP 084021 3A2J_> 



EP0 840 213 A2 



to the next firing time, T1 7, and delivers this information over lines 1 664 to an access resolution circuit 1 670, and over 
lines 1520f to the tag compare and privilege check circuit 1630. Also note that there may be firing times which have no 
valid instructions, possibly due to operational dependencies detected by the TOLL software. In this case, no instructions 
would be fetched from the cache and transmitted to the PIQ interface. 
5 The access resolution circuit 1670 coordinates which circuitry has access to the instruction cache RAMS 1610. 

Typically, these RAMs can satisfy only a single request at each clock cycle. Since there could be two requests to the 
RAMs at one time, an arbitration method must be implemented to determine which circuitry obtains access. This is a 
conventional issue in the design of cache memory, and the access resolution circuit resolves the priority question as is 
well known in the field. 

10 The present system can and preferably does support several users simultaneously in both time and space. In pre- 
vious prior art approaches (CDC, IBM, etc.), multi-user support was accomplished solely by timesharing the proces- 
sors). In other words, the processors were shared in time. In this system, multi-user support is accomplished (in space) 
by assigning an LRD to each user that is given time on the processor elements. Thus, there is a spatial aspect to the 
sharing of the processor elements. The operating system of the machine deals with those users assigned to the same 

15 LRD in a timeshared manner, thereby adding the temporal dimension to the sharing of the processors. 

Multi-user support is accomplished by the multiple LRDs, the use of plural processor elements, and the multiple 
context files 660 supporting the register files and condition code storage. As several users may be executing in the proc- 
essor elements at the same time, additional pieces of information must be attached to each instruction prior to its exe- 
cution to uniquely identify the instruction source and any resources that it may use. For example, a register identifier 

20 must contain the absolute value of the subroutine procedural level and the context identifier as well as the actual regis- 
ter number. Memory addresses must also contain the LRD identifier from which the instruction was issued to be prop- 
erly routed through the LRD-Memory interconnection network to the appropriate data cache. 

The additional and required information comprises two components, a static and a dynamic component; and the 
information is termed "shared context storage mapping" (SCSM). The static information results from the compiler out- 

25 put and the TOLL software gleans the information from the compiler generated instruction stream and attaches the reg- 
ister information to the instruction prior to its being received by an LRD. 

The dynamic information is hardware attached to the instruction by the LRD prior to its issuance to the processors. 
This information is composed of the context/LRD identifier corresponding to the LRD issuing the instruction, the abso- 
lute value of the current procedural level of the instruction, the process identifier of the current instruction stream, and 

30 preferably the instruction status information that would normally be contained in the processors of a system having 
processors that are not context free. This later information would be composed of error masks, floating point format 
modes, rounding modes, and so on. 

In the operation of the circuitry in Figure 16, one or more execution sets are delivered into the instruction cache cir- 
cuitry. The header information for each set is delivered to one or more successive cache partitions and is routed to the 

35 context control unit 1660. The instructions in the execution set are then individually, on a round robin basis, routed to 
each successive cache partition unit 1522. A hardware identification tag is attached to each instruction and the instruc- 
tion is then stored in RAM 1610. As previously discussed, each execution set is of sufficient length to minimize instruc- 
tion cache defaults and the RAM 1610 is of sufficient size to store the execution sets. When the processor elements 
require the instructions, the number and cache locations of the valid instructions matching the appropriate IFTs are 

40 determined. The instructions stored in the RAM's 1610 are read out; the identification tags are verified; and the privilege 
status checked. The instructions are then delivered to the PIQ bus interface unit 1544. Each instruction that is delivered 
to the PIQ bus interface unit 1544, as is set forth in Table 1 7, includes the identification tag (ID) and the hardware added 
SCSM information. 

45 TABLE 17 

CACHEO: I4(T16) (PE2) ( ID2 ) ( SCSMO ) 

CACHE1: I2(T17) (PEO) ( ID3 ) ( SCSM1 ) 

50 

CACHE2 : 10 ( T16 ) ( PEO ) ( IDO ) ( SCSM2 ) 
I5(T17) (PE1) ( ID4) (SCSM3 ) 

CACHE 3 : I1(T16) (PE1) ( ID1) (SCSM4) 
55 I3(T18) (PEO) ( ID5) ( SCSM5 ) 

If an instruction is not stored in RAM 1610, a cache miss occurs and a new execution set containing the instruction is 
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read from main memory over lines 1523. 

In Figure 17, the details of the PIQ bus interface unit 1544 and the PIQ buffer unit 1560 are set forth. Referring to 
Figure 17, the PIQ bus interface unit 1544 receives instructions as set forth in Table 17, above, over lines 1536. A 
search tag hardware 1702 has access to the value of the present instruction firing time over lines 1549 and searches 

5 the cache memories 1 522 to determine the address(es) of those registers containing instructions having the correct fir- 
ing times. The search tag hardware 1702 then makes available to the instruction cache control circuitry 1518 the 
addresses of those memory locations for determination by the instruction cache control of which instructions to next 
select for delivery to the PIQ bus interface 1544. 

These instructions access, in parallel, a two-dimensional array of bus interface units (Bill's) 1 700. The bus interface 

10 units 1 700 are interconnected in a full access non-blocking network by means of connections 1 710 and 1 720, and con- 
nect over lines 1552 to the PIQ buffer unit 1560. Each bus interface unit (BIU) 1700 is a conventional address compar- 
ison circuit composed of: Tl 74L85 4 bit magnitude comparators, Texas Instruments Company, P.O. Box 225012, Dallas, 
Texas 75265. In the matrix multiply example, for instruction firing time T16, CACHE0 contains instruction 14 and 
CACHE3 (corresponding to CACHE n in Figure 17) contains instruction 11. The logical processor number assigned to 

15 instruction 14 is PE2. The logical processor number PE2 activates a select (SEL) signal of the bus interface unit 1700 
for processor instruction queue 2 (this is the BIU3 corresponding to the CACHE0 unit containing the instruction). In this 
example, only that BIU3 is activated and the remaining bus interface units 1700 for that BIU3 row and column are not 
activated. Likewise, for CACHE3 (CACHE n in Figure 17), the corresponding BIU2 is activated for processor instruction 
QUEUE 1. 

20 The PIQ buffer unit 1560 is comprised of a number of processor instruction queues 1730 which store the instruc- 
tions received from the PIQ bus interface unit 1544 in a first in-first out (FIFO) fashion as shown in Table 18: 



TABLE 18 



25 



30 



PIQ0 


PIQ1 


PIQ2 


PIQ3 


10 


11 


14 




12 








13 









— In addition to performing instruction queueing functions, the PIQ's 1730 also keep track of the execution status of 
each instruction that is issued to the processor elements 640. In an ideal system, instructions could be issued to the 
processor elements every clock cycle without worrying about whether or not the instructions have finished execution. 

35 However, the processor elements 640 in the system may not be able to complete an instruction every clock cycle due 
to the occurrence of exceptional conditions, such as a data cache miss and so on. As a result, each PIQ 1730 tracks 
all instructions that it has issued to the processor elements 640 that are still in execution. The primary result of this 
tracking is that the PIQ's 1 730 perform the instruction clocking function for the LRD 620. In other words, the PIQ's 1 730 
determine when the next firing time register can be updated when executing straightline code. This in turn begins a new 

40 instruction fetch cycle. 

Instruction clocking is accomplished by having each PIQ 1730 form an instruction done signal that specifies that 
the instruction(s) issued by a given PIQ either have executed or, in the case of pipelined PE's, have proceeded to the 
next stage. This is then combined with all other PIQ instruction done signals from this LRD and is used to gate the incre- 
ment signal that increments the next firing time register. The "done" signals are delivered over lines 1 564 to the instruc- 
ts tion cache control 1518. 

Referring to Figure 18, the PIQ processor assignment circuit 1570 contains a two dimensional array of network 
interface units (NIU's) 1 800 interconnected as a full access switch to the PE-LRD network 650 and then to the various 
processor elements 640. Each network interface unit (NIU) 1800 is comprised of the same circuitry as the bus interface 
units (BIU) 1700 of Figure 17. In normal operation, the processor instruction queue #0 (PIQ0) can directly access proc- 
so essor element 0 by activating the NIU0 associated with the column corresponding to queue #0, the remaining network 
interface units NIUO, NIU1, NIU2, NIU3 of the PIQ processor alignment circuit for that column and row being deacti- 
vated. Likewise, processor instruction queue #3 (PIQ3) normally accesses processor element 3 by activating the NIU3 
of the column associated with queue #3, the remaining NIUO, NIU1 , NIU2, and NIU3 of that column and row being deac- 
tivated. The activation of the network interface units 1800 is under the control of an instruction select and assignment 
55 unit 1810. 

Unit 1810 receives signals from the PIQ's 1730 within the LRD that the unit 1810 is a member of over lines 1811, 
from all other units 1810 (of other LRD's) over lines 1813, and from the processor elements 640 through the network 
650. Each PIQ 1 730 furnishes the unit 1810 with a signal that corresponds to "I have an instruction that is ready to be 
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assigned to a processor." The other PIQ buffer units furnish this unit 1810 and every other unit 1810 with a signal that 
corresponds to "My PIQ 1730 (#x) has an instruction ready to be assigned to a processor." Finally, the processor ele- 
ments furnish each unit 1810 in the system with a signal that corresponds to "I can accept a new instruction." 

The unit 1810 on an LRD transmits signals to the PIQs 1 730 of its LRD over lines 181 1 , to the network interface 
s units 1 800 of its LRD over lines 1 860 and to the other units 1 81 0 of the other LRDs in the system over lines 1 81 3. The 
unit 1810 transmits a signal to each PIQ 1730 that corresponds to "Gate your instruction onto the PE-LRD interface bus 
(650)." The unit transmits a select signal to the network interface units 1800. Finally, the unit 1810 transmits a signal 
that corresponds to "I have used processor element #x" to each other unit 1810 in the system for each processor which 
it is using. 

io In addition, each unit 1 81 0 in each LRD has associated with it a priority that corresponds to the priority of the LRD. 
This is used to order the LRDs into an ascending order from zero to the number of LRDs in the system. The method 
used for assigning the processor elements is as follows. Given that the LRDs are ordered, many allocation schemes 
are possible (e.g., round robin, first come first served, time slice, etc.). However, these are implementation details and 
do not impact the functionality of this unit under the teachings of the present invention. 

is Consider the LRD with the current highest priority. This LRD gets all the processor elements that it requires and 
assigns the instructions that are ready to be executed to the available processor elements. If the processor elements 
are context free, the processor elements can be assigned in any manner whatsoever. Typically, however, assuming that 
all processors are functioning correctly, instructions from PIQ #0 are routed to processor element #0, provided of 
course, processor element #0 is available. 

20 The unit 1810 in the highest priority LRD then transmits this information to all other units 1810 in the system. Any 
processors left open are then utilized by the next highest priority LRD with instructions that can be executed. This allo- 
cation continues until all processors have been assigned. Hence, processors may be assigned on a priority basis in a 
daisy chained manner. 

If a particular processor element, for example, element 1 has failed, the instruction selective assignment unit 1810 
25 can deactivate that processor element by deactivating all network instruction units NIU1 . It can then, through hardware. 

reorder the processor elements so that, for example, processor element 2 receives all instructions logically assigned to 

processor element 1 , processor element 3 is now assigned to receive all instructions logically assigned to processor 2. 

etc. Indeed, redundant processor elements and network interface units can be provided to the system to provide for a 

high degree of fault tolerance. 
30 Clearly, this is but one possible implementation. Other methods are also realizable. 

b. Branch Execution Unit (BEU) 

Referring to Figure 1 9, the Branch Execution Unit (BEU) 1 548 is the unit in the present invention responsible for the 
35 execution of all branch instructions which occur at the end of each basic block. There is, in the illustrated embodiment, 
one BEU 1548 for each supported context and so, with reference to Figure 6, "n" supported contexts require "n" BEU's. 
The illustrated embodiment uses one BEU for each supported context because each BEU 1 548 is of simple design and, 
therefore, the cost of sharing a BEU between plural contexts would be more expensive than allowing each context to 
have its own BEU. 

40 The BEU 1548 executes branches in a conventional manner with the exception that the branch instructions are exe- 
cuted outside the PE's 640. The BEU 1548 evaluates the branch condition and, when the target address is selected, 
generates and places this address directly into the next instruction fetch register. The target address generation is con- 
ventional for unconditional and conditional branches that are not procedure calls or returns. The target address can be 
(a) taken directly from the instruction, (b) an offset from the current contents of the next instruction fetch register, or (c) 

45 an offset of a general purpose register of the context register file. 

A return branch from a subroutine is handled in a slightly different fashion. To understand the subroutine return 
branch, discussion of the subroutine call branch is required. When the branch is executed, a return address is created 
and stored. The return address is normally the address of the instruction following the subroutine call. The return 
address can be stored in a stack in memory or in other storage local to the branch execution unit. In addition, the exe- 

so cution of the subroutine call increments the procedural level counter. 

The return from a subroutine branch is also an unconditional branch. However, rather than containing the target 
address within the instruction, this type of branch reads the previously stored return address from storage, decrements 
the procedural level counter, and loads the next instruction fetch register with the return address. The remainder of the 
disclosure discusses the evaluation and execution of conditional branches. It should be noted the that techniques 

55 described also apply to unconditional branches, since these are. in effect, conditional branches in which the condition 
is always satisfied. Further, these same techniques also apply to the subroutine call and return branches, which perform 
the additional functions described above. 

To speed up conditional branches, the determination of whether a conditional branch is taken or not, depends 



26 

BNSCOCID: <EP 084021 3A2_I_> 



EP 0 840 213 A2 



solely on the analysis of the appropriate set of condition codes. Under the teachings of the present invention, no eval- 
uation of data is performed other than to manipulate the condition codes appropriately. In addition, an instruction, which 
generates a condition code that a branch will use, can transmit the code to BEU 1548 as well as to the condition code 
storage. This eliminates the conventional extra waiting time required for the code to become valid in the condition code 
5 storage prior to a BEU being able to fetch it. 

The present system also makes extensive use of delayed branching to guarantee program correctness. When a 
branch has executed and its effects are being propagated in the system, all instructions that are within the procedural 
domain of the branch must either have been executed or be in the process of being executed, as discussed in connec- 
tion with the example of Table 6. In other words, changing the next-instruction pointer (in response to the branch) takes 
10 place after the current firing time has been updated to point to the firing time that follows the last (temporally executed) 
instruction of the branch. Hence, in the example of Table 6, instruction 15 at firing time T1 7 is delayed until the comple- 
tion of T18 which is the last firing time for this basic block. The instruction time for the next basic block is then T19. 

The functionality of the BEU 1548 can be described as a four-state state machine: 

is Stage 1 : Instruction decode 

Operation decode 
Delay field decode 
Condition code access decode 

20 

Stage 2: Condition code fetch/receive 
Stage 3: Branch operation evaluation 
Stage 4: Next instruction fetch location and firing time update 

25 Along with determining the operation to be performed, the first stage also determines how long fetching can continue 
to take place after receipt of the branch by the BEU, and how the BEU is to access the condition codes for a conditional 
branch, that is, are they received or fetched. 

Referring to Figure 19, the branch instruction is delivered over bus 1546 from the PIQ bus interface unit 1544 into 
the instruction register 1900 of the BEU 1548. The fields of the instruction register 1900 are designated as: 

30 FETCH/ENABLE, CONDITION CODE ADDRESS, OP CODE, DELAY FIELD, and TARGET ADDRESS. The instruction 
register 1900 is connected over lines 1910a and 1910b to a condition code access unit 1920, over lines 1910c to an 
evaluation unit 1930, over lines 1910d to a delay unit 1940, and over lines 1910e to a next instruction interface 1950. 

Once an instruction has been issued to BEU 1548 from the PIQ bus interface 1544, instruction fetching must be 
held up until the value in the delay field has been determined. This value is measured relative to the receipt of the 

35 branch by the BEU, that is stage 1 . If there are no instructions that may be overlapped with this branch, this field value 
is zero. In this case, instruction fetching is held up until the outcome of the branch has been determined. If this field is 
non-zero, instruction fetching may continue for a number of firing times given by the value in this field. 

The condition code access unit 1920 is connected to the register file - PE network 670 over lines 1550 and to the 
evaluation unit 1930 over lines 1922. During stage 2 operation, the condition code access decode unit 1920 determines 

40 whether or not the condition codes must be fetched by the instruction, or whether the instruction that determines the 
branch condition delivers them. As there is only one instruction per basic block that will determine the conditional 
branch, there will never be more than one condition code received by the BEU for a basic block. As a result, the actual 
timing of when the condition code is received is not important. If it comes earlier than the branch, no other codes will be 
received prior to the execution of the branch. If it comes later, the branch will be waiting and the codes received will 

45 always be the right ones. Note that the condition code for the basic block can include plural codes received at the same 
or different times by the BEU. 

The evaluation unit 1930 is connected to the next instruction interface 1950 over lines 1932. The next instruction 
interface 1950 is connected to the instruction cache control circuit 1518 over lines 1549 and to the delay unit 1940 over 
lines 1942; and the delay unit 1940 is also connected to the instruction cache control unit 1518 over lines 1549. 

so During the evaluation stage of operation, the condition codes are combined according to a Boolean function that 
represents the condition being tested. In the final stage of operation, either fetching of the sequential instruction stream 
continues, if a conditional branch is not taken, or the next instruction pointer is loaded, if the branch is taken. 

The impact of a branch in the instruction stream can be described as follows. Instructions, as discussed, are sent 
to their respective PIQ's 1730 by analysis of the resident logical processor number (LPN). Instruction fetching can be 

55 continued until a branch is encountered, that is, until an instruction is delivered to the instruction register 1900 of the 
BEU 1548. At this point, in a conventional system without delayed branching, fetching would be stopped until the reso- 
lution of the branch instruction is complete. See, for example, "Branch Prediction Strategies and Branch Target Buffer 
Design", J.F.K. Lee & A.J. Smith, IEEE Computer Magazine, January, 1984. 
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In the present system, which includes delayed branching, instructions must continue to be fetched until the next 
instruction fetched is the last instruction of the basic block to be executed. The time that the branch is executed is then 
the last time that fetching of an instruction can take place without a possibility of modifying the next instruction address. 
Thus, the difference between when the branch is fetched and when the effects of the branch are actually felt corre- 

5 sponds to the number of additional firing time cycles during which fetching can be continued. 

The impact of this delay is that the BEU 1548 must have access to the next instruction firing time register of the 
cache controller 1518. Further, the BEU 1548 can control the initiation or disabling of the instruction fetch process per- 
formed by the instruction cache control unit 1518. These tasks are accomplished by signals over bus 1549. 

In operation the branch execution unit (BEU) 1548 functions as follows. The branch instruction, such as instruction 

w I5 in the example above, is loaded into the instruction register 1900 from the PIQ bus interface unit 1544. The contents 
of the instruction register then control the further operation of BEU 1548. The FETCH-ENABLE field indicates whether 
or not the condition code access unit 1920 should retrieve the condition code located at the address stored in the CC- 
ADX field (called FETCH) or whether the condition code will be delivered by the generating instruction. 

If a FETCH is requested, the unit 1 920 accesses the register f ile-PE network 670 (see Figure 6) to access the con- 

15 dition code storage 2000 which is shown in Figure 20. Referring to Figure 20, the condition code storage 2000, for each 
context file, is shown in the generalized case. A set of registers CC xy are provided for storing condition codes for pro- 
cedural level y. Hence, the condition code storage 2000 is accessed and addressed by the unit 1920 to retrieve, pursu- 
ant to a FETCH request, the necessary condition code. The actual condition code and an indication that the condition 
code is received by the unit 1 920 is delivered over lines 1 922 to the evaluation unit 1930. The OPCODE field, delivered 

20 to the evaluation unit 1930, in conjunction with the received condition code, functions to deliver a branch taken signal 
over line 1932 to the next instruction interface 1950. The evaluation unit 1930 is comprised of standard gate arrays such 
as those from LSI Logic Corporation, 1551 McCarthy Blvd., Milpitas, California 95035. 

The evaluation unit 1930 accepts the condition code set that determines whether or not the conditional branch is 
taken, and under control of the OPCODE field, combines the set in a Boolean function to generate the conditional 

25 branch taken signal. 

The next instruction interface 1950 receives the branch target address from the TARGET- ADX field of the instruc- 
tion register 1900 and the branch taken signal over line 1932. However, the interface 1950 cannot operate until an ena- 
ble signal is received from the delay unit 1940 over lines 1942. 

The delay unit 1940 determines the amount of time that instruction fetching can be continued after the receipt of a 

30 branch instruction by the BEU. Previously, it has been described that when a branch instruction is received by the BEU, 
instruction fetching continues fa one more cycle and then stops. The instruction fetched during this cycle is held up 
from passing through PIQ bus interface unit 1544 until the length of the delay field has been determined. For example, 
if the delay field is zero (implying that the branch is to be executed immediately), these instructions must still be withheld 
from the PIQ bus buffer unit until it is determined whether or not these are the right instructions to be fetched. If the delay 

35 field is non-zero, the instructions would be gated into the PIQ buffer unit as soon as the delay value was determined to 
be non-zero. The length of the delay is obtained from DELAY field of the instruction register 1900. The delay unit 
receives the delay length from register 1900 and clock impulses from the context control 1518 over lines 1549. The 
delay unit 1940 decrements the value of the delay at each clock pulse; and when fully decremented, the interface unit 
1950 becomes enabled. 

40 Hence, in the discussion of Table 6, instruction 15 is assigned a firing time T17 but is delayed until firing time T18. 
During the delay time, the interface 1950 signals the instruction cache control 1518 over line 1549 to continue to fetch 
instructions to finish the current basic block. When enabled, the interface unit 1950 delivers the next address (that is, 
the branch execution address) for the next basic block into the instruction cache control 1518 over lines 1549. 

In summary and for the example on Table 6, the branch instruction 15 is loaded into the instruction register 1900 

45 during time T17. However, a delay of one firing time (DELAY) is also loaded into the instruction register 1900 as the 
branch instruction cannot be executed until the last instruction 13 is processed during time T18. Hence, even though the 
instruction 15 is loaded in register 1900, the branch address for the next basic block, which is contained in the TARGET 
ADDRESS, does not become effective until the completion of time T18. In the meantime, the next instruction interface 
1950 issues instructions to the cache control 1 51 8 to continue processing the stream of instructions in the basic block. 

so Upon the expiration of the delay, the interface 1 950 is enabled, and the branch is executed by delivering the address of 
the next basic block to the instruction cache control 1518. 

Note that the delay field is used to guarantee the execution of all instructions in the basic block governed by this 
branch in single cycle context free PE's. A small complexity is encountered when the PE's are pipelined. In this case, 
there exist data dependencies between the instructions from the basic block just executed, and the instructions from 

55 the basic block to be executed. The TOLL software can analyze these dependencies when the next basic block is only 
targeted by the branch from this basic block. If the next basic block is targeted by more than one branch, the TOLL soft- 
ware cannot resolve the various branch possibilities and lets the pipelines drain, so that no data dependencies are vio- 
lated. One mechanism for allowing the pipelines to drain is to insert NO-OP (no operation) instructions into the 
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instruction stream. An alternate method provides an extra field in the branch instruction which inhibits the delivery of 
new instructions to the processor elements for a time determined by the data in the extra field. 

c. Processor Elements (PE) 

5 

So far in the discussions pertaining to the matrix multiply example, a single cycle processor element has been 
assumed. In other words, an instruction is issued to the processor element and the processor element completely exe- 
cutes the instruction before proceeding to the next instruction. However, greater performance can be obtained by 
employing pipelined processor elements. Accordingly, the tasks performed by the TOLL software change slightly. In 

10 particular, the assignment of the processor elements is more complex than is shown in the previous example; and the 
hazards that characterize a pipeline processor must be handled by the TOLL software. The hazards that are present in 
any pipelined processor manifest themselves as a more sophisticated set of data dependencies. This can be encoded 
into the TOLL software by one practiced in the art. See for example, T.K.R. Gross, Stanford University, 1983, "Code 
Optimization of Pipeline Constraints", Doctorate Dissertation Thesis. 

15 The assignment of the processors is dependent on the implementation of the pipelines and again, can be per- 
formed by one practiced in the art. A key parameter is determining how data is exchanged between the pipelines. For 
example, assume that each pipeline contains feedback paths between its stages. In addition, assume that the pipelines 
can exchange results only through the register sets 660. Instructions would be assigned to the pipelines by determining 
sets of dependent instructions that are contained in the instruction stream and then assigning each specific set to a spe- 

20 cif ic pipeline. This minimizes the amount of communication that must take place between the pipelines (via the register 
set), and hence speeds up the execution time of the program. The use of the logical processor number guarantees that 
the instructions will execute on the same pipeline. 

Alternatively, if there are paths available to exchange data between the pipelines, dependent instructions may be 
distributed across several pipeline processors instead of being assigned to a single pipeline. Again, the use of multiple 

25 pipelines and the interconnection network between them that allows the sharing of intermediate results manifests itself 
as a more sophisticated set of data dependencies imposed on the instruction stream. Clearly, the extension of the 
teachings of this specification to a pipelined system is within the skill of one practiced in the art. 

Importantly, the additional data (chaining) paths do not change the fundamental context free nature of the proces- 
sor elements of the present invention. That is, at any given time (for example, the completion of any given instruction 

30 cycle), the entire process state associated with a given program (that is, context) is captured completely external to the 
processor elements. Data chaining results merely in a transitory replication of some of the data generated within the 
processor elements during a specific instruction clock cycle. Referring to Figure 21 , a particular processor element 640 
has a four-stage pipeline processor element. All processor elements 640 according to the illustrated embodiment are 
identical. It is to be expressly understood, that any prior art type of processor element such as a micro-processor or 

35 other pipeline architecture could not be used under the teachings of the present description, because such processors 
retain substantial state information of the program they are processing. However, such a processor could be pro- 
grammed with software to emulate or simulate the type of processor necessary for the present system. The design of 
the processor element is determined by the instruction set architecture generated by the TOLL software and, therefore, 
from a conceptual viewpoint, is the most implementation dependent portion of this system. In the illustrated embodi- 

40 ment shown in Figure 21 , each processor element pipeline operates autonomously of the other processor elements in 
the system. Each processor element is homogeneous and is capable, by itself, of executing ail computational and data 
memory accessing instructions. In making computational executions, transfers are from register to register and for 
memory interface instructions, the transfers are from memory to registers or from registers to memory. Referring to Fig- 
ure 21, the four-stage pipeline for 35 the processor element 640 of the illustrated embodiment includes four discrete 

45 instruction registers 2100, 2110, 2120, and 2130. Each processor element also includes four stages: stage 1, 2140; 
stage 2, 2150; stage 3, 2160, and stage 4, 2170. The first instruction register 2100 is connected through the network 
650 to the PIQ processor assignment circuit 1570 and receives that information over bus 2102. The instruction register 
2100 then controls the operation of stage 1 which includes the hardware functions of instruction decode and register 0 
fetch and register 1 fetch. The first stage 2140 is interconnected to the instruction register over lines 2104 and to the 

so second instruction register 2110 over lines 2142. The first stage 2140 is also connected over a bus 2144 to the second 
stage 2150. Register 0 fetch and register 1 fetch of stage 1 are connected over lines 2146 and 2148, respectively, to 
network 670 for access to the register file 660. 

The second instruction register 21 10 is further interconnected to the third instruction register 2120 over lines 2112 
and to the second stage 2150 over lines 21 14. The second stage 2150 is also connected over a bus 2152 to the third 

55 stage 2160 and further has the memory write (MEM WRITE) register fetch hardware interconnected over lines 2154 to 
network 670 for access to the register file 660 and its condition code (CC) hardware connected over lines 2156 through 
network 670 to the condition code storage of context file 660. 

The third instruction register 2120 is interconnected over lines 2122 to the fourth instruction register 2130 and is 
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also connected over lines 21 24 to the third stage 21 60. The third stage 21 60 is connected over a bus 21 62 to the fourth 
stage 2170 and is further interconnected over lines 2164 through network 650 to the data cache interconnection net- 
work 1590. 

Finally, the fourth instruction register 2130 is interconnected over lines 2132 to the fourth stage, and the fourth 
5 stage has its store hardware (STORE) output connected over lines 21 72 and its effective address update (EFR ADD.) 
hardware circuit connected over lines 2174 to network 670 for access to the register file 660. In addition, the fourth 
stage has its condition code store (CC STORE) hardware connected over lines 2176 through network 670 to the con- 
dition code storage of context file 660. 

The operation of the four-stage pipeline shown in Figure 21 will now be discussed with respect to the example of 
10 Table 1 and the information contained in Table 19 which describes the operation of the processor element for each 
instruction. 



TABLE 19 



15 


Instruction IO, (11): 




Stage 1 


Fetch Reg to form Mem-adx 




Stage 2 


Form Mem-adx 


20 


Stage 3 


Perform Memory Read 




Stage 4 


Store RO, (R1) 




Instruction 12: 




Stage 1 


Fetch Reg RO and R1 


25 


Stage 2 


No-Op 




Stage 3 


Perform multiply 




Stage 4 


Store R2 and CC 


30 


Instruction 13: 




Stage 1 


Fetch Reg R2 and R3 




Stage 2 


No-Op 




Stage 3 


Perform addition 


35 


Stage 4 


Store R3 and CC 




Instruction 14: 




Stage 1 


Fetch Reg R4 


40 


Stage 2 


No-Op 




Stage 3 


Perform decrement 




Stage 4 


Store R4 and CC 



45 For instructions 10 and 11 , the performance by the processor element 640 in Figure 21 is the same except in stage 
4. The first stage is to fetch the memory address from the register which contains the address in the register file. Hence, 
stage 1 interconnects circuitry 2140 over lines 2146 through network 670 to that register and downloads it into register 
0 from the interface of stage 1 . Next, the address is delivered over bus 21 44 to stage 2, and the memory write hardware 
forms the memory address. The memory address is then delivered over bus 21 52 to the third stage which reads mem- 

so ory over 21 64 through network 650 to the data cache interconnection network 1 590. The results of the read operation 
are then stored and delivered to stage 4 for storage in register R0. Stage 4 delivers the data over lines 2172 through 
network 670 to register R0 in the register file. The same operation takes place for instruction 11 except that the results 
are stored in register 1. Hence, the four stages of the pipeline (Fetch, Form Memory Address, Perform Memory Read, 
and Store The Results) flow data through the pipe in the manner discussed, and wh n instruction 10 has passed 

55 through stage 1 , the first stage of instruction 11 commences. This overlapping or pipelining is conventional in the art 
Instruction 12 fetches the information stored in registers R0 and R1 in the register file 660 and delivers them into 
registers REG0 and REG1 of stage 1 . The contents are delivered over bus 21 44 through stage 2 as a no operation and 
then over bus 2152 into stage 3. A multiply occurs with the contents of the two registers, the results are delivered over 
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bus 2162 into stage 4 which then stores the results over lines 21 72 through network 670 into register R2 of the register 
file 660. In addition, the condition code data is stored over lines 21 76 in the condition code storage of context files 660. 

Instruction 13 performs the addition of the data in registers R2 and R3 in the same fashion, to store the results, at 
stage 4, in register R3 and to update the condition code data for that instruction. Finally, instruction 14 operates in the 
5 same fashion except that stage 3 performs a decrement of the contents of register R4. 

Hence, according to the example of Table I, the instructions for PEO, would be delivered from the PIQO in the fol- 
lowing order: IO, 12, and 13. These instructions would be sent through the PEO pipeline stages (S1, S2, S3, and S4), 
based the upon the instruction firing times (T16, T17, and T18), as follows: 

w 

TABLE 20 



20 



PE 


Inst 


T16 


T17 


T18 


T19 


T20 


T21 


PEO: 


10 


S1 


S2 


S3 


S4 








12 




SI 


S2 


S3 


S4 






13 






S1 


S2 


S3 


S4 


PE1: 


11 


S1 


S2 


S3 


S4 






PE2: 


14 


S1 


S2 


S3 


S4 







The schedule illustrated in Table 20 is not however possible unless data chaining is introduced within the pipeline 
processor (intraprocessor data chaining) as well as between pipeline processors (interprocessor data chaining). The 
requirement for data chaining occurs because an instruction no longer completely executes within a single time cycle 

25 illustrated by, for example, instruction firing time T16. Thus, for a pipeline processor, the TOLL software must recognize 
that the results of the store which occurs at stage 4 (T19) of instructions 10 and 11 are needed to perform the multiply at 
stage 3 (T19) of instruction 12, and that fetching of those operands normally takes place at stage 1 (T1 7) of instruction 
12. Accordingly, in the normal operation of the pipeline, for processors PEO and PE1, the operand data from registers 
R0 and R1 is not available until the end of firing time T1 8 while it is needed by stage 1 of instruction 12 at time T1 7. 

30 To operate according to the schedule illustrated in Table 20, additional data (chaining) paths must be made availa- 
ble to the processors, paths which exist both internal to the processors and between processors. These paths, well 
known to those practiced in the art, are the data chaining paths. They are represented, in Figure 21 , as dashed lines 
2180 and 2182. Accordingly, therefore, the resolution of data dependencies between instructions and all scheduling of 
processor resources which are performed by the TOLL software prior to program execution, take into account the avail- 

35 ability of data chaining when needed to make available data directly from the output, for example, of one stage of the 
same processor or a stage of a different processor. This data chaining capability is well known to those practiced in the 
art and can be implemented easily in the TOLL software analysis by recognizing each stage of the pipeline processor 
as being, in effect, a separate processor having resource requirements and certain dependencies, that is, that an 
instruction when started through a pipeline will preferably continue in that same pipeline through all of its processing 

40 stages. With this in mind, the speed up in processing can be observed in Table 20 where the three machine cycle times 
for the basic block are completed in a time of only six pipeline cydes. It should be borne in mind that the cycle time for 
a pipeline is approximately one-fourth the cycle time for the non-pipeline processor in the illustrated embodiment of the 
invention. 

The pipeline of Figure 21 is composed of four equal (temporal) length stages. The first stage 2140 performs the 
45 instruction decode, determines what registers to fetch and store, and performs up to two source register fetches which 
can be required for the execution of the instruction. 

The second stage 2150 is used by the computational instructions for the condition code fetch if required. It is also 
the effective address generation stage for the memory interface instructions. 

The effective address operations that are supported in the preferred embodiment of the invention are: 

50 

1 . Absolute address 

The full memory address is contained in the instruction. 

2. Register indirect 

The full memory address is contained in a register. 
55 3. Register indexed/based 

The full memory address is formed by combining the designated registers and immediate data. 

a. Rn op K 
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b. Rn op Rm 

c. Rn op K op Rm 

d. Rn op Rm op K 

5 where "op" can be addition (+), subtraction (-), or multiplication (*) and "K" is a constant. 

As an example, the addressing constructs presented in the matrix multiply inner loop example are formed from 
case 3-a where the constant "K" is the length of a data element within the array and the operation is addition (+). 
At a conceptual level, the effective addressing portion of a memory access instruction is composed of three basic 

10 functions; the designation and procurement of the registers and immediate data needed for the calculation, the combi- 
nation of these operands in order to form the desired address, and if necessary, updating of any one of the registers 
involved. This functionality is common in the prior art and is illustrated by the autoincrement and autodecrement modes 
of addressing available in the DEC processor architecture. See, for example, DEC VAX Architecture Handbook. 

Aside from the obvious hardware support required, the effective addressing is supported by the TOLL software, 

15 and impacts the TOLL software by adding functionality to the memory accessing instructions. In other words, an effec- 
tive address memory access can be interpreted as a concatenation of two operations, the first being the effective 
address calculation and the second being the actual memory access. This functionality can be easily encoded into the 
TOLL software by one skilled in the art in much the same manner as an add, subtract or multiply instruction would be. 
The described effective addressing constructs are to be interpreted as but one possible embodiment of a memory 

20 accessing system. There are a plethora of other methods and modes for generating a memory address that are known 
to those skilled in the art. In other words, the effective addressing constructs described above are for design complete- 
ness only, and are not to be construed as a key element in the design of the system. 

Referring to Figure 22, various structures of data or data fields within the pipeline processor element of Figure 21 
are illustrated for a system which is a multi-user system in both time and space. As a result, across the multiple pipe- 
rs lines, instructions from different users may be executing, each with its own processor state. As the processor state is 
not typically associated with the processor element, the instruction must carry along the identifiers that specify this 
state. This processor state is supported by the LRD, register file and condition code file assigned to the user. 

A sufficient amount of information must be associated with each instruction so that each memory access, condition 
code access or register access can uniquely identify the target of the access. In the case of the registers and condition 

30 codes, this additional information constitutes the absolute value of the procedural level (PL) and context identifiers (CI) 
and is attached to the instruction by the SCSM attachment unit 1650. This is illustrated in Figures 22a, 22b and 22c 
respectively. The context identifier portion is used to determine which register or condition code plane (Fig. 6) is being 
accessed. The procedural level is used to determine which procedural level of registers (Fig. 13) is to be accessed. 
Memory accesses also require that the LRD that supports the current user be identified so that the appropriate data 

35 cache can be accessed. This is accomplished through the context identifier. The data cache access further requires 
that a process identifier (PID) for the current user be available to verify that the data present in the cache is indeed the 
data desired. Thus, an address issued to the data cache takes the form of Figure 22d. The miscellaneous field is com- 
posed of additional information describing the access, for example, read or write, user or system, etc. 

Finally, due to the fact that there can be several users executing across the pipelines during a single time interval, 

40 information that controls the execution of the instructions, and which would normally be stored within the pipeline, must 
be associated with each instruction instead. This information is reflected in the ISW field of an instruction word as illus- 
trated in Figure 22a. The information in this field is composed of control fields like error masks, floating point format 
descriptors, rounding mode descriptors, etc. Each instruction would have this field attached, but, obviously, may not 
require all the information. This information is used by the ALU stage 2160 of the processor element. 

45 This instruction information relating to the ISW field, as well as the procedural level, context identification and proc- 
ess identifier, are attached dynamically by the SCSM attacher (1650) as the instruction is issued from the instruction 
cache. 

Claims 

50 

1. A system for executing branches in single entry-single exit (SESE) basic blocks (BBs) contained within a program, 
wherein the system comprises: 

means (620) receptive of the program for determining the branch instruction within each said basic block of the 
55 program, the determining means being further capable of adding instruction firing time information to the 

branch instruction, 

means (620, 640) operative on the instructions in each said basic block for processing the instructions, and 
means (620, 1548) operative on the branch instruction in the basic block for completing the execution of the 
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branch instruction during the same time as the processing means is processing the last executed non-branch 
instruction in the basic block so that the execution of the branch instruction occurs in parallel with the execution 
of the instructions in the basic block in order to speed up the overall processing of the program by the system. 

s 2. A system according to claim 1 , characterised in that: 

each basic block has a plurality of non-branch instructions and ends with a branch instruction, 
the firing time information identifies a time of execution of the branch instruction which is a variable number of 
instruction cycles prior to a time of execution of a last-to-be-executed instruction of the basic block, and 
10 the means operative on the received branch instruction in the basic block is responsive to the firing time infor- 

mation for completing the execution of the branch instruction no later than the same time as the processing 
means is processing the last-to-be-executed non-branch instruction in the basic block. 

A system according to claim 1 , characterised in that 

the branches are scheduled branches, each basic block has a plurality of non-branch instructions and a branch 
instruction, 

the firing time information identifies a time of execution of the branch instruction which is a variable number of 
instruction cycles prior to a time of execution of a last-to-be-executed instruction of the basic block, and 
the means operative on the received branch instruction in the basic block is responsive to the time information, 
for completing the execution of the scheduled branch instruction no later than during the same time as the 
processing means is processing the last-to-be-executed non-branch instructions in the basic block. 

4. A system for executing branches in single entry-single exit (SESE) basic blocks (BBs) in a plurality of programs uti- 
25 lized by a number of users, wherein the system comprises: 

means (160) receptive of each of said programs for determining the branch instruction within each said basic 
block of each of the programs, the determining means being further capable of adding instruction firing time 
information to the branch instructions, 

means (620, 640) operative on the instructions in each said basic block of each said program for processing 
the programs, and 

means (620, 1548) operative on the branch instructions in each said basic block for completing the execution 
of the branch instruction during the same time as the processing means is processing the last executed non- 
branch instruction in the basic block for a given program so that the execution of the branch instruction occurs 
in parallel with the execution of said instructions in the basic block whereby overall processing throughput of all 
the programs by the system is increased. 

5. A system according to claim 4, characterised in that: 
each basic block has a plurality of non-branch instructions and a branch instruction, 

the firing time information identifies a time of execution of the branch instruction which is a variable number of 
instruction cycles prior to a time of execution of a last-to-be-executed instruction of the basic block, and 
the means operative on the received branch instructions in each said basic block is responsive to the firing time 
information for completing the execution of the branch instruction no later than the same time as the processing 
means is processing the last-to-be-executed non-branch instructions in the basic block for a given program. 

6. A system according to claim 4, characterised in that 

the branches, and each basic block has a plurality of non-branch instructions and a branch instruction. 
so the firing time information identifies a time of execution of the branch instruction which is a variable number of 

instruction cycles prior to a time of execution of a last-to-be-executed instruction of the basic block, and 
the means operative on the received branch instructions in each said basic block completes the execution of 
the scheduled branch instruction no later than during the same time as the processing means is processing 
the last-to-be-executed non-branch instruction in the basic block for a given program. 

55 

7. A system for executing branches in single entry-single exit (SESE) basic blocks (BBs) contained within a program, 
the basic block having a plurality of non-branch instructions and a branch instruction, wherein the system com- 
prises: 
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means receptive of the program for determining the branch instruction within each said basic block of the pro- 
gram, the determining means further scheduling processing of the branch instruction, 
means operative on the received non-branch instructions in each said basic block for processing the said 
instructions, and 

5 means operative on the received branch instruction in the basic block for beginning execution of the branch 

instruction at one of a variable number of instruction cycles prior to a time of execution of a last-to-be-executed 
instruction of the basic block and 

for completing the execution of the scheduled branch instruction during an instruction cycle which occurs no 
later than during the processing of the last-to-be-executed non-branch instruction in the basic block so that the 
10 execution of the branch instruction occurs in parallel with the execution of the non-branch instructions in the 

basic block thereby speeding up the overall processing of the program by the system. 

8- A system for executing branches in single entry-single exit (SESE) basic blocks (BBs) in a plurality of programs uti- 
lized by a number of users, the basic block having a plurality of non-branch instructions and a branch instruction, 
15 wherein the system comprises: 

means receptive of each of the programs for determining the branch instruction within each said basic block of 
each of the programs, the determining means further scheduling processing of the branch instructions, 
means operative on the received non-branch instructions in each said basic block of each of the programs for 

20 processing the programs, and 

means operative on the received branch instruction in each said basic block for beginning execution of the 
branch instruction at one of a variable number of instruction cycles prior to a time of execution of a last-to-be- 
executed instruction of the basic block and for completing the execution of the scheduled branch instruction 
during an instruction cycle which occurs no later than during the processing of the last-to-be-executed non- 

25 branch instruction in the basic block for a given program so that the execution of the branch instruction occurs 

in parallel with the execution of the non-branch instructions in the basic block whereby overall processing 
throughput of all the programs by the system is increased. 

9. A machine implemented method for operating a programmed computer for executing branches in single entry-sin- 
30 gle exit (SESE) basic blocks (BBs) contained within a program, each basic block having a plurality of non-branch 

instructions and a branch instruction, wherein the method comprises the steps of: 

determining the branch instruction within each of the basic blocks of the program, 
adding information to the branch instruction, 
35 processing the instructions in each said basic block, 

beginning execution of the branch instruction at one of a variable number of instruction cycles prior to a time of 
execution of a last-to-be-executed instruction of the basic block, and 

completing the execution of the branch instruction in the basic block, based upon the added information, during 
an instruction cycle no later than during the processing of the last-to-be-executed non-branch instruction in the 
40 basic block so that the execution of the branch instruction occurs in parallel with the execution of the non- 

branch instructions in the basic block thereby speeding up the overall processing of the program. 

10. A method according to claim 9, characterised in that the adding step adds instruction firing time information to the 
branch instruction for scheduling the branch instruction, the firing time information identifying a time for beginning 

45 execution of the branch instruction, the method further comprising completing the execution of the scheduled 
branch instruction according to the firing time information. 

11. A machine implemented method for operating a programmed computer for executing branches in single entry-sin- 
gle exit (SESE) basic blocks (BBs) contained within a program, each basic block having a plurality of non-branch 

so instructions and a branch instruction, wherein method comprises the steps of: 

determining the branch instruction within each of the basic blocks of the program, 
scheduling processing of the branch instruction, 
processing the instructions in each basic block, 
55 beginning execution of the branch instruction at one of a variable number of instruction cycles prior to a time of 

execution of a last-to-be-executed instruction of the basic block, and 

completing the execution of the scheduled branch instruction during an instruction cycle no later than during 
the processing of the last-to-be- executed non-branch instruction in the basic block so that the execution of the 
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branch instruction occurs in parallel with the execution of the non-branch instructions in the basic block thereby 
speeding up the overall processing of the program. 

1 2. A method according to claim 1 1 , characterised by: 

adding instruction firing time information to the scheduled branch instruction for scheduling the processing of 
the branch instruction, the firing time information identifying a time of execution of the branch instruction which 
is a variable number of instruction cycles prior to a time of execution of a last-to-be-executed instruction of the 
basic block, and 

completing the execution of the scheduled branch instruction according to the firing time information. 

13. A machine implemented method for operating a programmed computer for executing branches in single entry-sin- 
gle exit (SESE) basic blocks (BBs) in a plurality of programs utilized by a number of users, each basic block having 
a plurality of non-branch instructions and a branch instruction, the method comprising the steps of: 

determining the branch instruction within each of the basic blocks of each of the programs, 

scheduling processing of the branch instructions, 

processing the instructions in each basic block of each program, 

for beginning execution of the branch instruction at one of a variable number of instruction cycles prior to a time 
of execution of a last-to-be-executed instruction of the basic block, and 

completing the execution of the scheduled branch instruction during an instruction cycle occurring no later than 
during the processing of the last-to-be-executed non -branch instruction in the basic block for a given program 
so that the execution of the branch instruction occurs in parallel with the execution of the non-branch instruc- 
tions in the basic block whereby overall processing throughput of all the programs is increased. 

14. A machine implemented method for operating a programmed computer for executing branches in single entry-sin- 
gle exit (SESE) basic blocks (BBs) in a plurality of programs used by a number of users, each basic block having a 
stream of instructions including a plurality of non-branch instructions and a branch instruction, wherein the method 
comprises the steps of: 

determining the branch instruction within each said basic block of each of said programs, 

adding information to the branch instructions, 

processing the instruction in each basic block of each program, 

beginning execution of the branch instruction at one of a variable number of instruction cycles prior to a time of 
execution of a last-to-be-executed instruction of the basic block, and 

completing the execution of the branch instructions in each said basic block based upon the added information, 
during an instruction cycle occurring no later than during the last instruction cycle used for processing the last- 
to-be-executed non-branch instruction in each respective basic block tor a given program so that the execution 
of the branch instruction occurs in parallel with the execution of the non-branch instructions in the basic block 
thereby speeding up the overall processing of the programs. 

15. An instruction processing apparatus including a system for issuing instructions in a first order and processing the 
instructions in a different order, wherein the system comprises: 

a storage configured to store at least a portion of the instructions to be processed, the stored instructions 
including instructions of a first type and of a second type, instructions of the second type each being associated 
with a delay value; 

an issue circuit coupled to the storage circuit, the issue circuit being configured to issue the stored instructions 
in the said first order; and 

at least one processor element coupled to the issue circuit for receiving the stored instructions in the first order, 
the processor element being configured to process the issued instructions in the said different order, the said 
processing occurring after each said stored instruction of the first type is issued and after a delay time after 
each said stored instruction of the second type is issued, the delay time being determined based on the delay 
value. 

16. A system according to claim 15, characterised in that the first type of instructions consists of non-branch instruc- 
tions. 
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17. A system according to claim 15, characterised in that the second type of instructions consists of branch instruc- 
tions. 

18. A system according to claim 15, characterised in that the associated delay value specifies a number of processing 
s cycles. 

1 9. A system according to claim 1 5, characterised in that the associated delay value specifies a number of instructions. 

20. A system according to claim 15, characterised in that the associated delay value is established prior to the issuing 
10 of the stored instructions. 

21. A system according to claim 15, characterised in that the associated delay value is specified in each instruction of 
the first type. 

is 22. A method of issuing a stream of instructions in a first order and processing the instructions in a different order, 
wherein the method comprises the steps of: 

storing at least a portion of the stream of instructions, including instructions of a first type and of a second type; 
associating each instruction of the second type with a delay value; 
20 issuing the stored instructions in the first order; 

determining a delay time for each issued instruction of the second type based on the delay value for the instruc- 
tion of the second type; 

processing each instruction of the first type after it is issued; and 

processing each instruction of the second type after the determined delay time after it is issued. 

25 

23. A method according to claim 22, characterised in that the instructions of the first type consist of non-branch type 
instructions. 

24. A method according to claim 22, characterised in that the instructions of the second type consist of branch type 
30 instructions. 

25. A method according to claim 22, characterised in that the storing step includes storing the delay values associated 
with the instructions of the second type. 

35 26. A method according to claim 22, characterised in that the step of associating the delay values occurs prior to the 
issuing of the instructions of the second type. 

27. A method according to claim 26, characterised in that the associating step is performed automatically so that the 
delay values are determined without human intervention. 

40 

28. A method according to claim 26, characterised in that the associating step is performed in a static manner and the 
determined delay values are specified in the instructions of the second type. 

29. A method according to claim 22, characterised in that the step of associating the delay values occurs prior to the 
45 storing of the instructions of the second type. 

30. A method according to claim 22, characterised in that the step of associating the delay values comprises the step 
of specifying a number of processing cycles. 

so 31 . A method according to claim 22, characterised in that the step of associating the delay values comprises the step 
of specifying a number of instructions. 

32. Instruction processing apparatus including a system for storing instructions in a first order and selecting the instruc- 
tions for processing in a different order, wherein the system comprises: 

55 

a storage configured to hold at least a portion of the instructions to be processed, each of the stored instruc- 
tions having an associated selection control value, 

an instruction selection circuit coupled to the storage, the selection circuit selecting instructions for execution 
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in the said different order based on the information contained in the associated selection control value of the 
said instruction, 

at least one processor element coupled to the instruction selection circuit, and configured to process the 
selected instructions in the said different selected order. 

5 

33. Apparatus according to claim 32, characterised in that the selection control value specifies a number of instruction 
processing cycles. 

34. Apparatus according to claim 32, characterised in that the selection control value specifies a number of instruc- 
10 tions. 

35. Apparatus according to claim 32, characterised in that the selection control value is specified in each said instruc- 
tion. 

15 36. Apparatus according to claim 32, characterised in that the selection control value is established prior to the storing 
of the instructions. 

37. Apparatus according to claim 32, characterised in that the selection control circuit comprises an associative mem- 
ory circuit to address the selection control values. 

20 

38. A method of storing a stream of instructions in a first order and selecting the instructions for processing in a different 
order, wherein the method comprises the steps of: 

storing at least a portion of the stream of instructions, 
25 associating each said instruction with a selection control value; 

selecting the stored instructions in the said different order based on information contained in the said associ- 
ated selection control value; 

processing each said selected instruction in the said different selected order. 

30 39. A method according to claim 38, characterised in that the storing step includes storing the selection control values 
associated with the instructions. 

40. A method according to claim 38, characterised in that the selection control value occurs prior to said storing of said 
instructions. 

35 

41 . A method according to claim 38, characterised in that the associated selection control value is specified in each 
said instruction. 

42. A method according to claim 41 , characterised in that the associating step is performed automatically so that the 
40 selection control values are determined without human intervention. 

43. A method according to claim 41 , characterised in that the associating step is performed in a static manner and the 
said determined selection control values are specified in the said instructions. 

45 44. A method according to claim 38, characterised in that the step of associating the selection control value comprises 
the step of specifying a number of processing cycles. 

45. A method according to claim 38, characterised in that the step of associating the selection control value comprises 
the step of specifying a number of instructions. 

50 

46. A method according to claim 38, characterized in that the step selecting the instructions uses associative memory 
to perform the said selection. 
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